In Section \refsec:impl_details, we provide more implementation details for SUGAR pre-training and downstream adaptation. Then in Section \refsec:add_results we present additional quantitative results. We further perform real robot experiments in Section \refsec:real_robot_expr to demonstrate the effectiveness of SUGAR pre-training for robotic manipulation in the real world. Finally, we discuss limitations and future work in Section \refsec:limitation.

\section

Implementation Details \labelsec:impl_details

\subsection

Pre-training

\nbf

Network details We set the number of points N=4096𝑁4096N=4096italic_N = 4096, the number of key points Ne=256subscript𝑁𝑒256N_{e}=256italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 256 and the group size Se=32subscript𝑆𝑒32S_{e}=32italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 32 to obtain the point cloud input tokens. The SUGAR encoder and decoder contains L=12𝐿12L=12italic_L = 12 transformer blocks with hidden size d=384𝑑384d=384italic_d = 384 and 6 attention heads per block.

\nbf

Training details We pre-train two sets of models according to the pre-training data: ‘SN’ uses objects only in ShapeNet, and ‘Ens’ uses the ensembled four datasets. For the ‘SN’ model, we train 100K iterations on the single-object dataset with learning rate 1e-4 and 100K iterations on the multi-object dataset with learning rate 1e-5 and batch size 128. For the ‘Ens’ model, we train 300K iterations on the single-object dataset and 200K iterations on the multi-object dataset using the same learning rate and batch size as in ‘SN’ models. The pre-training is performed on one NVIDIA-A100 GPU, taking 50 hours for the ‘SN’ model and 130 hours for the ‘Ens’ model.

\subsection

Referring expression grounding For the OCID-Ref dataset, we fix the point cloud encoder and only finetune the prompt-based decoder. We finetune the model with a batch size of 64 and learning rate of 1e-4 for 20 epochs. For the RoboRefit dataset, we finetune the full model with a batch size of 16 and learning rate of 4e-5 for 50 epochs. We use the AdamW optimizer with cosine learning rate scheduler.

\subsection

Language-guided robotic manipulation

\nbf

Experimental setup Our experimental setup on RLBench \citejames2020rlbench 10 tasks is the same as previous works \citeguhur2023hiveformer,chen2023polarnet. Specifically, we use three cameras located on the left shoulder, right shoulder and wrist of the robot with known camera intrinsics and extrinsics. Each camera produces an RGB-D image with image resolution of 128 ×\times× 128 at every step. A merged point cloud can be obtained given the camera parameters. Following \citechen2023polarnet, we only keep points inside the robot’s workspace by using a fixed bounding box around the table. We use voxel downsampling to uniformly downsample the point cloud with 0.5cm grid size. For robotic control, we use keysteps \citeliu2022autolambda,guhur2023hiveformer,chen2023polarnet - key turning points in action trajectories where the gripper changes its openness state or velocities of joints are close to zero. The control policy should predict a position (3D), rotation (4D represented by quaternion) and openness state (1D) of the gripper for the next keystep. The default motion planner in RLBench is used to find a trajectory between two keysteps.

\nbf

Model details Figure \reffig:rlbench_policy_architecture illustrates the policy network in detail. We first combine the action prompt embedding y1Lsubscriptsuperscript𝑦𝐿1y^{L}_{1}italic_y start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and point embeddings {xiL}i=1Nesuperscriptsubscriptsuperscriptsubscript𝑥𝑖𝐿𝑖1subscript𝑁𝑒\{x_{i}^{L}\}_{i=1}^{N_{e}}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to compute a heatmap over all the key points, which denotes the importance of the key points for action prediction. We then average the point embeddings and position of key points respectively using the heatmap. The averaged point embedding is concatenated with y1Lsubscriptsuperscript𝑦𝐿1y^{L}_{1}italic_y start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to regress the position offset relative to the averaged key point position, a rotation vector and an openness state. The policy is trained by behavior cloning, with MSE loss for position and rotation, and BCE loss for openness state.

\nbf

Training details We use a batch size of 8 to train the model for 200K iterations for the 10 RLBench tasks. We adopt a learning rate of 5e-5 for the model trained from scratch, while a lower learning rate of 2e-5 for the model initialized from SUGAR pre-training.

\includegraphics

[width=1]figs/rlbench_policy_architecture.png

Figure \thefigure: Network architecture for language-guided robotic manipulation. The point cloud encoder and prompt based decoder can be finetuned from SUGAR pre-training. We use two multi-layer perceptrons modules (MLP) as the action prediction head.

1 Additional Results

\nbf

Zero-shot object recognition Though we consider the Ensembled w/o LVIS setup to be better suited for evaluating the generalization ability of models, we include results with LVIS training in Table 1 for complete comparison with prior work [liu2023openshape]. Training with LVIS split improves performance on the LVIS dataset but does not impact much on the other two datasets. Our model still outperforms the SoTA method [liu2023openshape] under this setup.

Table \thetable: Zero-shot object recognition performance with models trained on Ensembled w/ LVIS dataset.
\toprule\multirow2*Method \multirow2*ModelNet40 ScanObjectNN Objaverse-LVIS
OBJ_ONLY OBJ_BG PB_T50_RS Top1 Top3 Top5
\midruleOpenShape [liu2023openshape] 84.4 54.0 59.1 43.6 46.8 69.1 77.0
\midruleSUGAR (single) 84.6 65.3 67.6 49.8 49.5 72.2 78.8
SUGAR (multi) 84.5 64.9 66.8 48.3 46.8 69.7 76.6
\bottomrule
\nbf

Referring expression grounding We provide an additional variant SUGAR (Ens_s) in Table 1, which is pre-trained on single objects of the ensembled dataset. To be noted, we only initialize the point cloud encoder for SUGAR variants pre-trained on single objects as we find initializing both encoder and decoder deteriorates the performance. As the decoder in single-object pre-training focuses on the overall scene for cross-modal learning, we hypothesize that the learned cross-modal attentions can suffer from recognition of local objects. As shown in Table 1, the single object pre-training on the ensembled dataset does not benefit the generalization on unseen cluttered scenes in testB split, demonstrating the importance of pre-traning on multi-object scenes.

Table \thetable: Performance of referring expression detection (evaluated by [email protected]) and referring expression segmentation (evaluated by mIoU) on the RoboRefit dataset.
\toprule\multirow2*Method testA testB
[email protected] mIoU [email protected] mIoU
\midruleSUGAR (no pre-train) 87.56 81.31 55.62 57.02
SUGAR (Ens_s) 88.11 81.71 52.59 56.57
SUGAR (Ens_m) 89.47 82.11 65.04 62.80
\bottomrule
Table \thetable: Averaged success rate of three runs for multi-task policies on 10 tasks of RLBench simulator.
\topruleMethod Pre-train Avg.
Pick &
Lift
Pick-Up
Cup
Push
Button
Put
Knife
Put
Money
Reach
Target
Slide
Block
Stack
Wine
Take
Money
Take
Umbrella
\midrulePolarNet [chen2023polarnet] ShapeNetPart 89.8±1.5subscript89.8plus-or-minus1.589.8_{\pm 1.5}89.8 start_POSTSUBSCRIPT ± 1.5 end_POSTSUBSCRIPT 97.8±1.4subscript97.8plus-or-minus1.497.8_{\pm 1.4}97.8 start_POSTSUBSCRIPT ± 1.4 end_POSTSUBSCRIPT 86.0±2.1subscript86.0plus-or-minus2.186.0_{\pm 2.1}86.0 start_POSTSUBSCRIPT ± 2.1 end_POSTSUBSCRIPT 99.6±0.4subscript99.6plus-or-minus0.499.6_{\pm 0.4}99.6 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 80.5±1.1subscript80.5plus-or-minus1.180.5_{\pm 1.1}80.5 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 94.1±0.8subscript94.1plus-or-minus0.894.1_{\pm 0.8}94.1 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 100±0.0subscript100plus-or-minus0.0100_{\pm 0.0}100 start_POSTSUBSCRIPT ± 0.0 end_POSTSUBSCRIPT 93.4±0.9subscript93.4plus-or-minus0.993.4_{\pm 0.9}93.4 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 80.5±3.6subscript80.5plus-or-minus3.680.5_{\pm 3.6}80.5 start_POSTSUBSCRIPT ± 3.6 end_POSTSUBSCRIPT 68.1±4.3subscript68.1plus-or-minus4.368.1_{\pm 4.3}68.1 start_POSTSUBSCRIPT ± 4.3 end_POSTSUBSCRIPT 97.8±0.2subscript97.8plus-or-minus0.297.8_{\pm 0.2}97.8 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT
\midrule\multirow4*SUGAR - 85.9±3.9subscript85.9plus-or-minus3.985.9_{\pm 3.9}85.9 start_POSTSUBSCRIPT ± 3.9 end_POSTSUBSCRIPT 77.7±4.9subscript77.7plus-or-minus4.977.7_{\pm 4.9}77.7 start_POSTSUBSCRIPT ± 4.9 end_POSTSUBSCRIPT 92.7±4.2subscript92.7plus-or-minus4.292.7_{\pm 4.2}92.7 start_POSTSUBSCRIPT ± 4.2 end_POSTSUBSCRIPT 91.7±0.9subscript91.7plus-or-minus0.991.7_{\pm 0.9}91.7 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 69.4±8.0subscript69.4plus-or-minus8.069.4_{\pm 8.0}69.4 start_POSTSUBSCRIPT ± 8.0 end_POSTSUBSCRIPT 87.7±1.2subscript87.7plus-or-minus1.287.7_{\pm 1.2}87.7 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 99.7±0.4subscript99.7plus-or-minus0.499.7_{\pm 0.4}99.7 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 94.3±0.4subscript94.3plus-or-minus0.494.3_{\pm 0.4}94.3 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 83.1±7.8subscript83.1plus-or-minus7.883.1_{\pm 7.8}83.1 start_POSTSUBSCRIPT ± 7.8 end_POSTSUBSCRIPT 66.8±9.2subscript66.8plus-or-minus9.266.8_{\pm 9.2}66.8 start_POSTSUBSCRIPT ± 9.2 end_POSTSUBSCRIPT 95.7±1.6subscript95.7plus-or-minus1.695.7_{\pm 1.6}95.7 start_POSTSUBSCRIPT ± 1.6 end_POSTSUBSCRIPT
SN_m 93.0±1.0subscript93.0plus-or-minus1.093.0_{\pm 1.0}93.0 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 93.1±1.3subscript93.1plus-or-minus1.393.1_{\pm 1.3}93.1 start_POSTSUBSCRIPT ± 1.3 end_POSTSUBSCRIPT 94.5±1.0subscript94.5plus-or-minus1.094.5_{\pm 1.0}94.5 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 98.9±0.8subscript98.9plus-or-minus0.898.9_{\pm 0.8}98.9 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 85.4±1.4subscript85.4plus-or-minus1.485.4_{\pm 1.4}85.4 start_POSTSUBSCRIPT ± 1.4 end_POSTSUBSCRIPT 97.8±1.3subscript97.8plus-or-minus1.397.8_{\pm 1.3}97.8 start_POSTSUBSCRIPT ± 1.3 end_POSTSUBSCRIPT 100±0.0subscript100plus-or-minus0.0100_{\pm 0.0}100 start_POSTSUBSCRIPT ± 0.0 end_POSTSUBSCRIPT 97.9±0.8subscript97.9plus-or-minus0.897.9_{\pm 0.8}97.9 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 94.5±1.5subscript94.5plus-or-minus1.594.5_{\pm 1.5}94.5 start_POSTSUBSCRIPT ± 1.5 end_POSTSUBSCRIPT 70.0±1.6subscript70.0plus-or-minus1.670.0_{\pm 1.6}70.0 start_POSTSUBSCRIPT ± 1.6 end_POSTSUBSCRIPT 98.4±0.2subscript98.4plus-or-minus0.298.4_{\pm 0.2}98.4 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT
Ens_m w/o grasp 92.0±1.6subscript92.0plus-or-minus1.692.0_{\pm 1.6}92.0 start_POSTSUBSCRIPT ± 1.6 end_POSTSUBSCRIPT 93.1±1.3subscript93.1plus-or-minus1.393.1_{\pm 1.3}93.1 start_POSTSUBSCRIPT ± 1.3 end_POSTSUBSCRIPT 93.7±1.3subscript93.7plus-or-minus1.393.7_{\pm 1.3}93.7 start_POSTSUBSCRIPT ± 1.3 end_POSTSUBSCRIPT 98.8±1.1subscript98.8plus-or-minus1.198.8_{\pm 1.1}98.8 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 85.5±0.1subscript85.5plus-or-minus0.185.5_{\pm 0.1}85.5 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 92.3±5.3subscript92.3plus-or-minus5.392.3_{\pm 5.3}92.3 start_POSTSUBSCRIPT ± 5.3 end_POSTSUBSCRIPT 99.9±0.1subscript99.9plus-or-minus0.199.9_{\pm 0.1}99.9 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 97.3±1.4subscript97.3plus-or-minus1.497.3_{\pm 1.4}97.3 start_POSTSUBSCRIPT ± 1.4 end_POSTSUBSCRIPT 93.7±0.6subscript93.7plus-or-minus0.693.7_{\pm 0.6}93.7 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 68.8±4.2subscript68.8plus-or-minus4.268.8_{\pm 4.2}68.8 start_POSTSUBSCRIPT ± 4.2 end_POSTSUBSCRIPT 97.2±0.9subscript97.2plus-or-minus0.997.2_{\pm 0.9}97.2 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT
Ens_m 93.0±1.7subscript93.0plus-or-minus1.793.0_{\pm 1.7}93.0 start_POSTSUBSCRIPT ± 1.7 end_POSTSUBSCRIPT 95.8±1.3subscript95.8plus-or-minus1.395.8_{\pm 1.3}95.8 start_POSTSUBSCRIPT ± 1.3 end_POSTSUBSCRIPT 95.7±1.6subscript95.7plus-or-minus1.695.7_{\pm 1.6}95.7 start_POSTSUBSCRIPT ± 1.6 end_POSTSUBSCRIPT 96.1±5.1subscript96.1plus-or-minus5.196.1_{\pm 5.1}96.1 start_POSTSUBSCRIPT ± 5.1 end_POSTSUBSCRIPT 86.5±2.7subscript86.5plus-or-minus2.786.5_{\pm 2.7}86.5 start_POSTSUBSCRIPT ± 2.7 end_POSTSUBSCRIPT 94.2±1.6subscript94.2plus-or-minus1.694.2_{\pm 1.6}94.2 start_POSTSUBSCRIPT ± 1.6 end_POSTSUBSCRIPT 100±0.0subscript100plus-or-minus0.0100_{\pm 0.0}100 start_POSTSUBSCRIPT ± 0.0 end_POSTSUBSCRIPT 97.0±0.5subscript97.0plus-or-minus0.597.0_{\pm 0.5}97.0 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 93.5±0.6subscript93.5plus-or-minus0.693.5_{\pm 0.6}93.5 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 72.0±2.9subscript72.0plus-or-minus2.972.0_{\pm 2.9}72.0 start_POSTSUBSCRIPT ± 2.9 end_POSTSUBSCRIPT 98.8±0.9subscript98.8plus-or-minus0.998.8_{\pm 0.9}98.8 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT
\bottomrule
\nbf

Language-guided robotic manipulation In Table 1, we include both the averaged success rate and standard deviations for the RLBench 10-task experiment. As the 10 RLBench tasks use objects with simple shapes like cups and cubes, pre-training on ShapeNet can be sufficient and thus we do not observe further performance improvement from pre-training on Ens_m. Compared to PolarNet, our model performs slightly worse on Pick & Lift and Push Button though it achieves better performance on average. To be noted, PolarNet employs additional normal and height features in the point cloud, while our method omits those for generalizability in pre-training. As shown in PolarNet, normal and height features benefit some tasks like “Push Button” where the main failure cases are that the gripper does not push down enough to the button. We also notice relatively large variations on individual tasks, and thus we consider the averaged performance is more stable for comparison. In Figure 1, we present the performance on the RLBench validation split for policies trained from scratch and initialized from SUGAR pre-training. We can see that the policy can converge much faster and achieve better performance with the pre-training.

\includegraphics

[width=0.9]figs/rlbench_10tasks_val.png

Figure \thefigure: Success rate on RLBench validation split in different training iterations. We compare the policy trained from scratch and the model initialized from SUGAR pre-training.

2 Real-world Robotic Manipulation

{subfigure}

[b]0.19 \includegraphics[width=]figs/real_task_stack_cups.png {subfigure}[b]0.19 \includegraphics[width=]figs/real_task_put_fruit.png {subfigure}[b]0.19 \includegraphics[width=]figs/real_task_open_drawer.png {subfigure}[b]0.19 \includegraphics[width=]figs/real_task_put_cabinet.png {subfigure}[b]0.19 \includegraphics[width=]figs/real_task_hang_mug.png

Figure \thefigure: Stack cup.
Figure \thefigure: Put fruit in box.
Figure \thefigure: Open drawer.
Figure \thefigure: Put item in cabinet.
Figure \thefigure: Hang mug.
Figure \thefigure: Illustration of the adopted five real robot tasks.

To evaluate the effectiveness of SUGAR pre-training for real robots, we further perform real world experiments for language-guided robotic manipulation. To be specific, we use a UR5 robotic arm equipped with a RG6 gripper and set two Intel RealSense D435 RGB-D cameras on the front and lateral sides of the robot’s workspace. We adopt 5 real-world tasks including stack cup, put fruit in box, open drawer, put item in cabinet and hang mug as illustrated in Figure 2. For each task, we collect 20 real-robot demonstrations, where each demonstration consists of RGB-D images and proprioceptive information of the gripper at keysteps (typically 3-6 keysteps).

Table \thetable: Success rate of multi-task policies on 5 real-world tasks. We evaluate 10 episodes for each task.
\toprule no pretrain SUGAR (Ens_m)
\midruleStack cup 0/10 10/10
Put fruit in box 0/10 4/10
Open drawer 0/10 3/10
Put item in cabinet 0/10 9/10
Hang mug 0/10 6/10
\bottomrule

We train a multi-task policy using the collected real-robot data, and evaluate 10 episodes for each task where the object locations and distractor objects are different from the training data. Table 2 presents results of a model trained from scratch on the real robot data and a model initialized from SUGAR pre-training. The model trained from scratch overfits on the limited training data and totally fails in evaluation. As shown in Figure 2, the model trained from scratch has serious problems of localizing the target object. Our SUGAR pre-training significantly improves the performance for language-guided manipulation in the real world, leading to an average of 64% success rate over the five tasks. Figure 2 presents a successful case of putting lemon in the box. However, we also notice that the model initialized from SUGAR pre-training still has problems in precise object localization in Figure 2. The problems can result from the sub-optimal network design that largely downsamples the point cloud, the regression action prediction head that is more unstable compared to classification, and the noisy depth sensors. We will investigate more on the policy networks to improve the robotic manipulation performance.

{subfigure}

[b]1 \includegraphics[width=]figs/real_scratch_put_fruit_failure.png {subfigure}[b]1 \includegraphics[width=]figs/real_sugar_put_fruit_success.png {subfigure}[b]1 \includegraphics[width=]figs/real_sugar_put_fruit_failure.png

Figure \thefigure: A failure case of the multi-task policy trained from scratch.
Figure \thefigure: A successful case of the multi-task policy initialized from SUGAR pre-training.
Figure \thefigure: A failure case of the multi-task policy initialized from SUGAR pre-training.
Figure \thefigure: Examples of real world execution on the Put fruit in box task for different policies.

3 Limitations and Future Work

This work only adopt a plain transformer architecture for point cloud encoding, which is computationally expensive. For example, compared to the SoTA method PolarNet [chen2023polarnet], our model consists of 4.5x more parameters (65M vs. 14M) and runs 1.3x slower (18h vs. 14h in training on one V100 GPU). This is because PolarNet is based on a UNet backbone which is more efficient. Our vanilla transformer-based backbone alone does not a show clear advantage over the UNet backbone for robotic manipulation as seen in Table 1, although the proposed pre-training significantly boosts the performance. We believe that the proposed pre-training can benefit other architectures and plan to explore more efficient 3D backbones in our future work.