Search | arXiv e-print repository

Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning

Authors: Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xiang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, Masayoshi Tomizuka

Abstract: The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). B… ▽ More The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). By adopting Mixture of Experts (MoE) within a transformer-based diffusion policy, SDP selectively activates experts and skills, enabling efficient and task-specific learning without retraining the entire model. SDP not only reduces the burden of active parameters but also facilitates the seamless integration and reuse of experts across various tasks. Extensive experiments on diverse tasks in both simulations and real world show that SDP 1) excels in multitask scenarios with negligible increases in active parameters, 2) prevents forgetting in continual learning of new tasks, and 3) enables efficient task transfer, offering a promising solution for advanced robotic applications. Demos and codes can be found in https://forrest-110.github.io/sparse_diffusion_policy/. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00898 [pdf, other]

Residual-MPPI: Online Policy Customization for Continuous Control

Authors: Pengcheng Wang, Chenran Li, Catherine Weaver, Kenta Kawamoto, Masayoshi Tomizuka, Chen Tang, Wei Zhan

Abstract: Policies learned through Reinforcement Learning (RL) and Imitation Learning (IL) have demonstrated significant potential in achieving advanced performance in continuous control tasks. However, in real-world environments, it is often necessary to further customize a trained policy when there are additional requirements that were unforeseen during the original training phase. It is possible to fine-… ▽ More Policies learned through Reinforcement Learning (RL) and Imitation Learning (IL) have demonstrated significant potential in achieving advanced performance in continuous control tasks. However, in real-world environments, it is often necessary to further customize a trained policy when there are additional requirements that were unforeseen during the original training phase. It is possible to fine-tune the policy to meet the new requirements, but this often requires collecting new data with the added requirements and access to the original training metric and policy parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time which we call Residual-MPPI. It is able to customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings. Also, Residual-MPPI only requires access to the action distribution produced by the prior policy, without additional knowledge regarding the original task. Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the few-shot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. Demo videos are available on our website: https://sites.google.com/view/residual-mppi △ Less

Submitted 3 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

arXiv:2406.16258 [pdf, other]

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

Authors: Yuxin Chen, Chen Tang, Chenran Li, Ran Tian, Peter Stone, Masayoshi Tomizuka, Wei Zhan

Abstract: Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thu… ▽ More Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention. △ Less

Submitted 23 June, 2024; originally announced June 2024.

ACM Class: I.2.6; I.2.9

arXiv:2406.12303 [pdf, other]

Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment

Authors: Yiheng Li, Heyang Jiang, Akio Kodaira, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu

Abstract: In this paper, we point out suboptimal noise-data map** leads to slow training of diffusion models. During diffusion training, current methods diffuse each image across the entire noise space, resulting in a mixture of all images at every point in the noise layer. We emphasize that this random mixture of noise-data map** complicates the optimization of the denoising function in diffusion model… ▽ More In this paper, we point out suboptimal noise-data map** leads to slow training of diffusion models. During diffusion training, current methods diffuse each image across the entire noise space, resulting in a mixture of all images at every point in the noise layer. We emphasize that this random mixture of noise-data map** complicates the optimization of the denoising function in diffusion models. Drawing inspiration from the immiscible phenomenon in physics, we propose Immiscible Diffusion, a simple and effective method to improve the random mixture of noise-data map**. In physics, miscibility can vary according to various intermolecular forces. Thus, immiscibility means that the mixing of the molecular sources is distinguishable. Inspired by this, we propose an assignment-then-diffusion training strategy. Specifically, prior to diffusing the image data into noise, we assign diffusion target noise for the image data by minimizing the total image-noise pair distance in a mini-batch. The assignment functions analogously to external forces to separate the diffuse-able areas of images, thus mitigating the inherent difficulties in diffusion training. Our approach is remarkably simple, requiring only one line of code to restrict the diffuse-able area for each image while preserving the Gaussian distribution of noise. This ensures that each image is projected only to nearby noise. To address the high complexity of the assignment algorithm, we employ a quantized-assignment method to reduce the computational overhead to a negligible level. Experiments demonstrate that our method achieve up to 3x faster training for consistency models and DDIM on the CIFAR dataset, and up to 1.3x faster on CelebA datasets for consistency models. Besides, we conduct thorough analysis about the Immiscible Diffusion, which sheds lights on how it improves diffusion training speed while improving the fidelity. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.04666 [pdf, other]

Underactuated Control of Multiple Soft Pneumatic Actuators via Stable Inversion

Authors: Wu-Te Yang, Burak Kurkcu, Masayoshi Tomizuka

Abstract: Soft grippers, with their inherent compliance and adaptability, show advantages for delicate and versatile manipulation tasks in robotics. This paper presents a novel approach to underactuated control of multiple soft actuators, specifically focusing on the synchronization of soft fingers within a soft gripper. Utilizing a single syringe pump as the actuation mechanism, we address the challenge of… ▽ More Soft grippers, with their inherent compliance and adaptability, show advantages for delicate and versatile manipulation tasks in robotics. This paper presents a novel approach to underactuated control of multiple soft actuators, specifically focusing on the synchronization of soft fingers within a soft gripper. Utilizing a single syringe pump as the actuation mechanism, we address the challenge of coordinating multiple degrees of freedom of a compliant system. The theoretical framework applies concepts from stable inversion theory, adapting them to the unique dynamics of the underactuated soft gripper. Through meticulous mechatronic system design and controller synthesis, we demonstrate both in simulation and experimentation the efficacy and applicability of our approach in achieving precise and synchronized manipulation tasks. Our findings not only contribute to the advancement of soft robot control but also offer practical insights into the design and control of underactuated systems for real-world applications. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 10 pages, 7 figures

arXiv:2406.04534 [pdf, other]

Strategically Conservative Q-Learning

Authors: Yutaka Shimizu, Joey Hong, Sergey Levine, Masayoshi Tomizuka

Abstract: Offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility by leveraging pre-collected, static datasets, thereby avoiding the limitations associated with collecting online interactions. The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions; doing so ineffectively will lead to polici… ▽ More Offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility by leveraging pre-collected, static datasets, thereby avoiding the limitations associated with collecting online interactions. The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions; doing so ineffectively will lead to policies that prefer OOD actions, which can lead to unexpected and potentially catastrophic results. Despite the variety of works proposed to address this issue, they tend to excessively suppress the value function in and around OOD regions, resulting in overly pessimistic value estimates. In this paper, we propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate, ultimately resulting in less conservative value estimates. Our approach exploits the inherent strengths of neural networks to interpolate, while carefully navigating their limitations in extrapolation, to obtain pessimistic yet still property calibrated value estimates. Theoretical analysis also shows that the value function learned by SCQ is still conservative, but potentially much less so than that of Conservative Q-learning (CQL). Finally, extensive evaluation on the D4RL benchmark tasks shows our proposed method outperforms state-of-the-art methods. Our code is available through \url{https://github.com/purewater0901/SCQ}. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2405.20323 [pdf, other]

$\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving

Authors: Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang

Abstract: Photorealistic 3D reconstruction of street scenes is a critical technique for develo** real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle b… ▽ More Photorealistic 3D reconstruction of street scenes is a critical technique for develo** real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild scenarios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaussian ($\textit{S}^3$Gaussian) method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our $\textit{S}^3$Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations. Code is available at: https://github.com/nnanhuang/S3Gaussian/. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: Code is available at: https://github.com/nnanhuang/S3Gaussian/

arXiv:2405.15757 [pdf, other]

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Authors: Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu

Abstract: This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by m… ▽ More This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15x, 46x, 108x, and 158x faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: Project page: https://jeff-liangf.github.io/projects/streamv2v

arXiv:2405.01333 [pdf, other]

NeRF in Robotics: A Survey

Authors: Guangming Wang, Lei Pan, Songyou Peng, Shaohui Liu, Chenfeng Xu, Yanzi Miao, Wei Zhan, Masayoshi Tomizuka, Marc Pollefeys, Hesheng Wang

Abstract: Meticulous 3D environment representations have been a longstanding goal in computer vision and robotics fields. The recent emergence of neural implicit representations has introduced radical innovation to this field as implicit representations enable numerous capabilities. Among these, the Neural Radiance Field (NeRF) has sparked a trend because of the huge representational advantages, such as sim… ▽ More Meticulous 3D environment representations have been a longstanding goal in computer vision and robotics fields. The recent emergence of neural implicit representations has introduced radical innovation to this field as implicit representations enable numerous capabilities. Among these, the Neural Radiance Field (NeRF) has sparked a trend because of the huge representational advantages, such as simplified mathematical models, compact environment storage, and continuous scene representations. Apart from computer vision, NeRF has also shown tremendous potential in the field of robotics. Thus, we create this survey to provide a comprehensive understanding of NeRF in the field of robotics. By exploring the advantages and limitations of NeRF, as well as its current applications and future potential, we hope to shed light on this promising area of research. Our survey is divided into two main sections: \textit{The Application of NeRF in Robotics} and \textit{The Advance of NeRF in Robotics}, from the perspective of how NeRF enters the field of robotics. In the first section, we introduce and analyze some works that have been or could be used in the field of robotics from the perception and interaction perspectives. In the second section, we show some works related to improving NeRF's own properties, which are essential for deploying NeRF in the field of robotics. In the discussion section of the review, we summarize the existing challenges and provide some valuable future research directions for reference. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 21 pages, 19 figures

arXiv:2404.06605 [pdf, other]

RoadBEV: Road Surface Reconstruction in Bird's Eye View

Authors: Tong Zhao, Lei Yang, Yichen Xie, Mingyu Ding, Masayoshi Tomizuka, Yintao Wei

Abstract: Road surface conditions, especially geometry profiles, enormously affect driving performance of autonomous vehicles. Vision-based online road reconstruction promisingly captures road information in advance. Existing solutions like monocular depth estimation and stereo matching suffer from modest performance. The recent technique of Bird's-Eye-View (BEV) perception provides immense potential to mor… ▽ More Road surface conditions, especially geometry profiles, enormously affect driving performance of autonomous vehicles. Vision-based online road reconstruction promisingly captures road information in advance. Existing solutions like monocular depth estimation and stereo matching suffer from modest performance. The recent technique of Bird's-Eye-View (BEV) perception provides immense potential to more reliable and accurate reconstruction. This paper uniformly proposes two simple yet effective models for road elevation reconstruction in BEV named RoadBEV-mono and RoadBEV-stereo, which estimate road elevation with monocular and stereo images, respectively. The former directly fits elevation values based on voxel features queried from image view, while the latter efficiently recognizes road elevation patterns based on BEV volume representing discrepancy between left and right voxel features. Insightful analyses reveal their consistence and difference with perspective view. Experiments on real-world dataset verify the models' effectiveness and superiority. Elevation errors of RoadBEV-mono and RoadBEV-stereo achieve 1.83cm and 0.50cm, respectively. The estimation performance improves by 50\% in BEV based on monocular image. Our models are promising for practical applications, providing valuable references for vision-based BEV perception in autonomous driving. The code is released at https://github.com/ztsrxh/RoadBEV. △ Less

Submitted 20 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: Dataset page: https://thu-rsxd.com/rsrd Code: https://github.com/ztsrxh/RoadBEV

arXiv:2404.04772 [pdf, other]

Efficient Reinforcement Learning of Task Planners for Robotic Palletization through Iterative Action Masking Learning

Authors: Zheng Wu, Yichuan Li, Wei Zhan, Changliu Liu, Yun-Hui Liu, Masayoshi Tomizuka

Abstract: The development of robotic systems for palletization in logistics scenarios is of paramount importance, addressing critical efficiency and precision demands in supply chain management. This paper investigates the application of Reinforcement Learning (RL) in enhancing task planning for such robotic systems. Confronted with the substantial challenge of a vast action space, which is a significant im… ▽ More The development of robotic systems for palletization in logistics scenarios is of paramount importance, addressing critical efficiency and precision demands in supply chain management. This paper investigates the application of Reinforcement Learning (RL) in enhancing task planning for such robotic systems. Confronted with the substantial challenge of a vast action space, which is a significant impediment to efficiently apply out-of-the-shelf RL methods, our study introduces a novel method of utilizing supervised learning to iteratively prune and manage the action space effectively. By reducing the complexity of the action space, our approach not only accelerates the learning phase but also ensures the effectiveness and reliability of the task planning in robotic palletization. The experimental results underscore the efficacy of this method, highlighting its potential in improving the performance of RL applications in complex and high-dimensional environments like logistics palletization. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: 8 pages, 8 figures

arXiv:2404.00237 [pdf, other]

Joint Pedestrian Trajectory Prediction through Posterior Sampling

Authors: Haotian Lin, Yixiao Wang, Mingxiao Huo, Chensheng Peng, Zhiyuan Liu, Masayoshi Tomizuka

Abstract: Joint pedestrian trajectory prediction has long grappled with the inherent unpredictability of human behaviors. Recent investigations employing variants of conditional diffusion models in trajectory prediction have exhibited notable success. Nevertheless, the heavy dependence on accurate historical data results in their vulnerability to noise disturbances and data incompleteness. To improve the ro… ▽ More Joint pedestrian trajectory prediction has long grappled with the inherent unpredictability of human behaviors. Recent investigations employing variants of conditional diffusion models in trajectory prediction have exhibited notable success. Nevertheless, the heavy dependence on accurate historical data results in their vulnerability to noise disturbances and data incompleteness. To improve the robustness and reliability, we introduce the Guided Full Trajectory Diffuser (GFTD), a novel diffusion model framework that captures the joint full (historical and future) trajectory distribution. By learning from the full trajectory, GFTD can recover the noisy and missing data, hence improving the robustness. In addition, GFTD can adapt to data imperfections without additional training requirements, leveraging posterior sampling for reliable prediction and controllable generation. Our approach not only simplifies the prediction process but also enhances generalizability in scenarios with noise and incomplete inputs. Through rigorous experimental evaluation, GFTD exhibits superior performance in both trajectory prediction and controllable generation. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2403.20001 [pdf, other]

Adaptive Energy Regularization for Autonomous Gait Transition and Energy-Efficient Quadruped Locomotion

Authors: Boyuan Liang, Lingfeng Sun, Xinghao Zhu, Bike Zhang, Ziyin Xiong, Chenran Li, Koushil Sreenath, Masayoshi Tomizuka

Abstract: In reinforcement learning for legged robot locomotion, crafting effective reward strategies is crucial. Pre-defined gait patterns and complex reward systems are widely used to stabilize policy training. Drawing from the natural locomotion behaviors of humans and animals, which adapt their gaits to minimize energy consumption, we propose a simplified, energy-centric reward strategy to foster the de… ▽ More In reinforcement learning for legged robot locomotion, crafting effective reward strategies is crucial. Pre-defined gait patterns and complex reward systems are widely used to stabilize policy training. Drawing from the natural locomotion behaviors of humans and animals, which adapt their gaits to minimize energy consumption, we propose a simplified, energy-centric reward strategy to foster the development of energy-efficient locomotion across various speeds in quadruped robots. By implementing an adaptive energy reward function and adjusting the weights based on velocity, we demonstrate that our approach enables ANYmal-C and Unitree Go1 robots to autonomously select appropriate gaits, such as four-beat walking at lower speeds and trotting at higher speeds, resulting in improved energy efficiency and stable velocity tracking compared to previous methods using complex reward designs and prior gait knowledge. The effectiveness of our policy is validated through simulations in the IsaacGym simulation environment and on real robots, demonstrating its potential to facilitate stable and adaptive locomotion. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: 8 pages, 5 figures

arXiv:2403.18960 [pdf, other]

Robust In-Hand Manipulation with Extrinsic Contacts

Authors: Boyuan Liang, Kei Ota, Masayoshi Tomizuka, Devesh Jha

Abstract: We present in-hand manipulation tasks where a robot moves an object in grasp, maintains its external contact mode with the environment, and adjusts its in-hand pose simultaneously. The proposed manipulation task leads to complex contact interactions which can be very susceptible to uncertainties in kinematic and physical parameters. Therefore, we propose a robust in-hand manipulation method, which… ▽ More We present in-hand manipulation tasks where a robot moves an object in grasp, maintains its external contact mode with the environment, and adjusts its in-hand pose simultaneously. The proposed manipulation task leads to complex contact interactions which can be very susceptible to uncertainties in kinematic and physical parameters. Therefore, we propose a robust in-hand manipulation method, which consists of two parts. First, an in-gripper mechanics model that computes a naïve motion cone assuming all parameters are precise. Then, a robust planning method refines the motion cone to maintain desired contact mode regardless of parametric errors. Real-world experiments were conducted to illustrate the accuracy of the mechanics model and the effectiveness of the robust planning framework in the presence of kinematics parameter errors. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted at ICRA 24

arXiv:2403.16786 [pdf, other]

doi 10.1109/LRA.2024.3387145

DBPF: A Framework for Efficient and Robust Dynamic Bin-Picking

Authors: Yichuan Li, Junkai Zhao, Yixiao Li, Zheng Wu, Rui Cao, Masayoshi Tomizuka, Yunhui Liu

Abstract: Efficiency and reliability are critical in robotic bin-picking as they directly impact the productivity of automated industrial processes. However, traditional approaches, demanding static objects and fixed collisions, lead to deployment limitations, operational inefficiencies, and process unreliability. This paper introduces a Dynamic Bin-Picking Framework (DBPF) that challenges traditional stati… ▽ More Efficiency and reliability are critical in robotic bin-picking as they directly impact the productivity of automated industrial processes. However, traditional approaches, demanding static objects and fixed collisions, lead to deployment limitations, operational inefficiencies, and process unreliability. This paper introduces a Dynamic Bin-Picking Framework (DBPF) that challenges traditional static assumptions. The DBPF endows the robot with the reactivity to pick multiple moving arbitrary objects while avoiding dynamic obstacles, such as the moving bin. Combined with scene-level pose generation, the proposed pose selection metric leverages the Tendency-Aware Manipulability Network optimizing suction pose determination. Heuristic task-specific designs like velocity-matching, dynamic obstacle avoidance, and the resight policy, enhance the picking success rate and reliability. Empirical experiments demonstrate the importance of these components. Our method achieves an average 84% success rate, surpassing the 60% of the most comparable baseline, crucially, with zero collisions. Further evaluations under diverse dynamic scenarios showcase DBPF's robust performance in dynamic bin-picking. Results suggest that our framework offers a promising solution for efficient and reliable robotic bin-picking under dynamics. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: 8 pages, 5 figures. This paper has been accepted by IEEE RA-L on 2024-03-24. See the supplementary video at youtube: https://youtu.be/n5af2VsKhkg

arXiv:2403.15712 [pdf, other]

doi 10.1109/LRA.2024.3379865

PNAS-MOT: Multi-Modal Object Tracking with Pareto Neural Architecture Search

Authors: Chensheng Peng, Zhaoyu Zeng, **ling Gao, Jundong Zhou, Masayoshi Tomizuka, Xinbing Wang, Chenghu Zhou, Nanyang Ye

Abstract: Multiple object tracking is a critical task in autonomous driving. Existing works primarily focus on the heuristic design of neural networks to obtain high accuracy. As tracking accuracy improves, however, neural networks become increasingly complex, posing challenges for their practical application in real driving scenarios due to the high level of latency. In this paper, we explore the use of th… ▽ More Multiple object tracking is a critical task in autonomous driving. Existing works primarily focus on the heuristic design of neural networks to obtain high accuracy. As tracking accuracy improves, however, neural networks become increasingly complex, posing challenges for their practical application in real driving scenarios due to the high level of latency. In this paper, we explore the use of the neural architecture search (NAS) methods to search for efficient architectures for tracking, aiming for low real-time latency while maintaining relatively high accuracy. Another challenge for object tracking is the unreliability of a single sensor, therefore, we propose a multi-modal framework to improve the robustness. Experiments demonstrate that our algorithm can run on edge devices within lower latency constraints, thus greatly reducing the computational requirements for multi-modal object tracking while kee** lower latency. △ Less

Submitted 23 March, 2024; originally announced March 2024.

Comments: IEEE Robotics and Automation Letters 2024. Code is available at https://github.com/PholyPeng/PNAS-MOT

Journal ref: IEEE Robotics and Automation Letters, 2024

arXiv:2403.12676 [pdf, other]

In-Hand Following of Deformable Linear Objects Using Dexterous Fingers with Tactile Sensing

Authors: Mingrui Yu, Boyuan Liang, Xiang Zhang, Xinghao Zhu, Xiang Li, Masayoshi Tomizuka

Abstract: Most research on deformable linear object (DLO) manipulation assumes rigid gras**. However, beyond rigid gras** and re-gras**, in-hand following is also an essential skill that humans use to dexterously manipulate DLOs, which requires continuously changing the grasp point by in-hand sliding while holding the DLO to prevent it from falling. Achieving such a skill is very challenging for robot… ▽ More Most research on deformable linear object (DLO) manipulation assumes rigid gras**. However, beyond rigid gras** and re-gras**, in-hand following is also an essential skill that humans use to dexterously manipulate DLOs, which requires continuously changing the grasp point by in-hand sliding while holding the DLO to prevent it from falling. Achieving such a skill is very challenging for robots without using specially designed but not versatile end-effectors. Previous works have attempted using generic parallel grippers, but their robustness is unsatisfactory owing to the conflict between following and holding, which is hard to balance with a one-degree-of-freedom gripper. In this work, inspired by how humans use fingers to follow DLOs, we explore the usage of a generic dexterous hand with tactile sensing to imitate human skills and achieve robust in-hand DLO following. To enable the hardware system to function in the real world, we develop a framework that includes Cartesian-space arm-hand control, tactile-based in-hand 3-D DLO pose estimation, and task-specific motion design. Experimental results demonstrate the significant superiority of our method over using parallel grippers, as well as its great robustness, generalizability, and efficiency. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.08125 [pdf, other]

Q-SLAM: Quadric Representations for Monocular SLAM

Authors: Chensheng Peng, Chenfeng Xu, Yue Wang, Mingyu Ding, Heng Yang, Masayoshi Tomizuka, Kurt Keutzer, Marco Pavone, Wei Zhan

Abstract: Monocular SLAM has long grappled with the challenge of accurately modeling 3D geometries. Recent advances in Neural Radiance Fields (NeRF)-based monocular SLAM have shown promise, yet these methods typically focus on novel view synthesis rather than precise 3D geometry modeling. This focus results in a significant disconnect between NeRF applications, i.e., novel-view synthesis and the requirement… ▽ More Monocular SLAM has long grappled with the challenge of accurately modeling 3D geometries. Recent advances in Neural Radiance Fields (NeRF)-based monocular SLAM have shown promise, yet these methods typically focus on novel view synthesis rather than precise 3D geometry modeling. This focus results in a significant disconnect between NeRF applications, i.e., novel-view synthesis and the requirements of SLAM. We identify that the gap results from the volumetric representations used in NeRF, which are often dense and noisy. In this study, we propose a novel approach that reimagines volumetric representations through the lens of quadric forms. We posit that most scene components can be effectively represented as quadric planes. Leveraging this assumption, we reshape the volumetric representations with million of cubes by several quadric planes, which leads to more accurate and efficient modeling of 3D scenes in SLAM contexts. Our method involves two key steps: First, we use the quadric assumption to enhance coarse depth estimations obtained from tracking modules, e.g., Droid-SLAM. This step alone significantly improves depth estimation accuracy. Second, in the subsequent map** phase, we diverge from previous NeRF-based SLAM methods that distribute sampling points across the entire volume space. Instead, we concentrate sampling points around quadric planes and aggregate them using a novel quadric-decomposed Transformer. Additionally, we introduce an end-to-end joint optimization strategy that synchronizes pose estimation with 3D reconstruction. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.07470 [pdf, other]

DrPlanner: Diagnosis and Repair of Motion Planners Using Large Language Models

Authors: Yuanfei Lin, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan, Matthias Althoff

Abstract: Motion planners are essential for the safe operation of automated vehicles across various scenarios. However, no motion planning algorithm has achieved perfection in the literature, and improving its performance is often time-consuming and labor-intensive. To tackle the aforementioned issues, we present DrPlanner, the first framework designed to automatically diagnose and repair motion planners us… ▽ More Motion planners are essential for the safe operation of automated vehicles across various scenarios. However, no motion planning algorithm has achieved perfection in the literature, and improving its performance is often time-consuming and labor-intensive. To tackle the aforementioned issues, we present DrPlanner, the first framework designed to automatically diagnose and repair motion planners using large language models. Initially, we generate a structured description of the planner and its planned trajectories from both natural and programming languages. Leveraging the profound capabilities of large language models in addressing reasoning challenges, our framework returns repaired planners with detailed diagnostic descriptions. Furthermore, the framework advances iteratively with continuous feedback from the evaluation of the repaired outcomes. Our approach is validated using search-based motion planners; experimental results highlight the need of demonstrations in the prompt and the ability of our framework in identifying and rectifying elusive issues effectively. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: @2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2403.06086 [pdf, other]

Towards Generalizable and Interpretable Motion Prediction: A Deep Variational Bayes Approach

Authors: Juanwu Lu, Wei Zhan, Masayoshi Tomizuka, Ye** Hu

Abstract: Estimating the potential behavior of the surrounding human-driven vehicles is crucial for the safety of autonomous vehicles in a mixed traffic flow. Recent state-of-the-art achieved accurate prediction using deep neural networks. However, these end-to-end models are usually black boxes with weak interpretability and generalizability. This paper proposes the Goal-based Neural Variational Agent (GNe… ▽ More Estimating the potential behavior of the surrounding human-driven vehicles is crucial for the safety of autonomous vehicles in a mixed traffic flow. Recent state-of-the-art achieved accurate prediction using deep neural networks. However, these end-to-end models are usually black boxes with weak interpretability and generalizability. This paper proposes the Goal-based Neural Variational Agent (GNeVA), an interpretable generative model for motion prediction with robust generalizability to out-of-distribution cases. For interpretability, the model achieves target-driven motion prediction by estimating the spatial distribution of long-term destinations with a variational mixture of Gaussians. We identify a causal structure among maps and agents' histories and derive a variational posterior to enhance generalizability. Experiments on motion prediction datasets validate that the fitted model can be interpretable and generalizable and can achieve comparable performance to state-of-the-art results. △ Less

Submitted 9 March, 2024; originally announced March 2024.

Comments: Accepted at AISTATS 2024

arXiv:2403.06041 [pdf, other]

MATRIX: Multi-Agent Trajectory Generation with Diverse Contexts

Authors: Zhuo Xu, Rui Zhou, Yida Yin, Huidong Gao, Masayoshi Tomizuka, Jiachen Li

Abstract: Data-driven methods have great advantages in modeling complicated human behavioral dynamics and dealing with many human-robot interaction applications. However, collecting massive and annotated real-world human datasets has been a laborious task, especially for highly interactive scenarios. On the other hand, algorithmic data generation methods are usually limited by their model capacities, making… ▽ More Data-driven methods have great advantages in modeling complicated human behavioral dynamics and dealing with many human-robot interaction applications. However, collecting massive and annotated real-world human datasets has been a laborious task, especially for highly interactive scenarios. On the other hand, algorithmic data generation methods are usually limited by their model capacities, making them unable to offer realistic and diverse data needed by various application users. In this work, we study trajectory-level data generation for multi-human or human-robot interaction scenarios and propose a learning-based automatic trajectory generation model, which we call Multi-Agent TRajectory generation with dIverse conteXts (MATRIX). MATRIX is capable of generating interactive human behaviors in realistic diverse contexts. We achieve this goal by modeling the explicit and interpretable objectives so that MATRIX can generate human motions based on diverse destinations and heterogeneous behaviors. We carried out extensive comparison and ablation studies to illustrate the effectiveness of our approach across various metrics. We also presented experiments that demonstrate the capability of MATRIX to serve as data augmentation for imitation-based motion planning. △ Less

Submitted 9 March, 2024; originally announced March 2024.

Comments: IEEE International Conference on Robotics and Automation (ICRA 2024)

arXiv:2402.18897 [pdf, other]

Contact-Implicit Model Predictive Control for Dexterous In-hand Manipulation: A Long-Horizon and Robust Approach

Authors: Yongpeng Jiang, Mingrui Yu, Xinghao Zhu, Masayoshi Tomizuka, Xiang Li

Abstract: Dexterous in-hand manipulation is an essential skill of production and life. Nevertheless, the highly stiff and mutable features of contacts cause limitations to real-time contact discovery and inference, which degrades the performance of model-based methods. Inspired by recent advancements in contact-rich locomotion and manipulation, this paper proposes a novel model-based approach to control dex… ▽ More Dexterous in-hand manipulation is an essential skill of production and life. Nevertheless, the highly stiff and mutable features of contacts cause limitations to real-time contact discovery and inference, which degrades the performance of model-based methods. Inspired by recent advancements in contact-rich locomotion and manipulation, this paper proposes a novel model-based approach to control dexterous in-hand manipulation and overcome the current limitations. The proposed approach has the attractive feature, which allows the robot to robustly execute long-horizon in-hand manipulation without pre-defined contact sequences or separated planning procedures. Specifically, we design a contact-implicit model predictive controller at high-level to generate real-time contact plans, which are executed by the low-level tracking controller. Compared with other model-based methods, such a long-horizon feature enables replanning and robust execution of contact-rich motions to achieve large-displacement in-hand tasks more efficiently; Compared with existing learning-based methods, the proposed approach achieves the dexterity and also generalizes to different objects without any pre-training. Detailed simulations and ablation studies demonstrate the efficiency and effectiveness of our method. It runs at 20Hz on the 23-degree-of-freedom long-horizon in-hand object rotation task. △ Less

Submitted 11 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

Comments: 7 pages, 8 figures, submitted to IROS2024

arXiv:2402.16836 [pdf, other]

PhyGrasp: Generalizing Robotic Gras** with Physics-informed Large Multimodal Models

Authors: Dingkun Guo, Yuqi Xiang, Shuqi Zhao, Xinghao Zhu, Masayoshi Tomizuka, Mingyu Ding, Wei Zhan

Abstract: Robotic gras** is a fundamental aspect of robot functionality, defining how robots interact with objects. Despite substantial progress, its generalizability to counter-intuitive or long-tailed scenarios, such as objects with uncommon materials or shapes, remains a challenge. In contrast, humans can easily apply their intuitive physics to grasp skillfully and change grasps efficiently, even for o… ▽ More Robotic gras** is a fundamental aspect of robot functionality, defining how robots interact with objects. Despite substantial progress, its generalizability to counter-intuitive or long-tailed scenarios, such as objects with uncommon materials or shapes, remains a challenge. In contrast, humans can easily apply their intuitive physics to grasp skillfully and change grasps efficiently, even for objects they have never seen before. This work delves into infusing such physical commonsense reasoning into robotic manipulation. We introduce PhyGrasp, a multimodal large model that leverages inputs from two modalities: natural language and 3D point clouds, seamlessly integrated through a bridge module. The language modality exhibits robust reasoning capabilities concerning the impacts of diverse physical properties on gras**, while the 3D modality comprehends object shapes and parts. With these two capabilities, PhyGrasp is able to accurately assess the physical properties of object parts and determine optimal gras** poses. Additionally, the model's language comprehension enables human instruction interpretation, generating gras** poses that align with human preferences. To train PhyGrasp, we construct a dataset PhyPartNet with 195K object instances with varying physical properties and human preferences, alongside their corresponding language descriptions. Extensive experiments conducted in the simulation and on the real robots demonstrate that PhyGrasp achieves state-of-the-art performance, particularly in long-tailed cases, e.g., about 10% improvement in success rate over GraspNet. Project page: https://sites.google.com/view/phygrasp △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.15583 [pdf, other]

Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

Authors: Yichen Xie, Hongge Chen, Gregory P. Meyer, Yong Jae Lee, Eric M. Wolff, Masayoshi Tomizuka, Wei Zhan, Yuning Chai, Xin Huang

Abstract: Due to the lack of depth cues in images, multi-frame inputs are important for the success of vision-based perception, prediction, and planning in autonomous driving. Observations from different angles enable the recovery of 3D object states from 2D image inputs if we can identify the same instance in different input frames. However, the dynamic nature of autonomous driving scenes leads to signific… ▽ More Due to the lack of depth cues in images, multi-frame inputs are important for the success of vision-based perception, prediction, and planning in autonomous driving. Observations from different angles enable the recovery of 3D object states from 2D image inputs if we can identify the same instance in different input frames. However, the dynamic nature of autonomous driving scenes leads to significant changes in the appearance and shape of each instance captured by the camera at different time steps. To this end, we propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence robust to the change in distance and perspective. The learned representation aids in instance-level correspondence across multiple input frames in downstream tasks. In the pretraining stage, the raw point clouds from LiDAR sensors are utilized to construct the long-term temporal correspondence for each instance, which serves as guidance for the extraction of instance-level representation from the vision-based bird's eye-view (BEV) feature map. Cohere3D encourages a consistent representation for the same instance at different frames but distinguishes between representations of different instances. We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks. Results show a notable improvement in both data efficiency and task performance. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.14194 [pdf, other]

BeTAIL: Behavior Transformer Adversarial Imitation Learning from Human Racing Gameplay

Authors: Catherine Weaver, Chen Tang, Ce Hao, Kenta Kawamoto, Masayoshi Tomizuka, Wei Zhan

Abstract: Imitation learning learns a policy from demonstrations without requiring hand-designed reward functions. In many robotic tasks, such as autonomous racing, imitated policies must model complex environment dynamics and human decision-making. Sequence modeling is highly effective in capturing intricate patterns of motion sequences but struggles to adapt to new environments or distribution shifts that… ▽ More Imitation learning learns a policy from demonstrations without requiring hand-designed reward functions. In many robotic tasks, such as autonomous racing, imitated policies must model complex environment dynamics and human decision-making. Sequence modeling is highly effective in capturing intricate patterns of motion sequences but struggles to adapt to new environments or distribution shifts that are common in real-world robotics tasks. In contrast, Adversarial Imitation Learning (AIL) can mitigate this effect, but struggles with sample inefficiency and handling complex motion patterns. Thus, we propose BeTAIL: Behavior Transformer Adversarial Imitation Learning, which combines a Behavior Transformer (BeT) policy from human demonstrations with online AIL. BeTAIL adds an AIL residual policy to the BeT policy to model the sequential decision-making process of human experts and correct for out-of-distribution states or shifts in environment dynamics. We test BeTAIL on three challenges with expert-level demonstrations of real human gameplay in Gran Turismo Sport. Our proposed residual BeTAIL reduces environment interactions and improves racing performance and stability, even when the BeT is pretrained on different tracks than downstream learning. Videos and code available at: https://sites.google.com/berkeley.edu/BeTAIL/home. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: Preprint

arXiv:2402.08931 [pdf, other]

Depth-aware Volume Attention for Texture-less Stereo Matching

Authors: Tong Zhao, Mingyu Ding, Wei Zhan, Masayoshi Tomizuka, Yintao Wei

Abstract: Stereo matching plays a crucial role in 3D perception and scenario understanding. Despite the proliferation of promising methods, addressing texture-less and texture-repetitive conditions remains challenging due to the insufficient availability of rich geometric and semantic information. In this paper, we propose a lightweight volume refinement scheme to tackle the texture deterioration in practic… ▽ More Stereo matching plays a crucial role in 3D perception and scenario understanding. Despite the proliferation of promising methods, addressing texture-less and texture-repetitive conditions remains challenging due to the insufficient availability of rich geometric and semantic information. In this paper, we propose a lightweight volume refinement scheme to tackle the texture deterioration in practical outdoor scenarios. Specifically, we introduce a depth volume supervised by the ground-truth depth map, capturing the relative hierarchy of image texture. Subsequently, the disparity discrepancy volume undergoes hierarchical filtering through the incorporation of depth-aware hierarchy attention and target-aware disparity attention modules. Local fine structure and context are emphasized to mitigate ambiguity and redundancy during volume aggregation. Furthermore, we propose a more rigorous evaluation metric that considers depth-wise relative error, providing comprehensive evaluations for universal stereo matching and depth estimation models. We extensively validate the superiority of our proposed methods on public datasets. Results demonstrate that our model achieves state-of-the-art performance, particularly excelling in scenarios with texture-less images. The code is available at https://github.com/ztsrxh/DVANet. △ Less

Submitted 26 February, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: 10 pages, 6 figures

arXiv:2401.15315 [pdf, other]

Learning Online Belief Prediction for Efficient POMDP Planning in Autonomous Driving

Authors: Zhiyu Huang, Chen Tang, Chen Lv, Masayoshi Tomizuka, Wei Zhan

Abstract: Effective decision-making in autonomous driving relies on accurate inference of other traffic agents' future behaviors. To achieve this, we propose an online belief-update-based behavior prediction model and an efficient planner for Partially Observable Markov Decision Processes (POMDPs). We develop a Transformer-based prediction model, enhanced with a recurrent neural memory model, to dynamically… ▽ More Effective decision-making in autonomous driving relies on accurate inference of other traffic agents' future behaviors. To achieve this, we propose an online belief-update-based behavior prediction model and an efficient planner for Partially Observable Markov Decision Processes (POMDPs). We develop a Transformer-based prediction model, enhanced with a recurrent neural memory model, to dynamically update latent belief state and infer the intentions of other agents. The model can also integrate the ego vehicle's intentions to reflect closed-loop interactions among agents, and it learns from both offline data and online interactions. For planning, we employ a Monte-Carlo Tree Search (MCTS) planner with macro actions, which reduces computational complexity by searching over temporally extended action steps. Inside the MCTS planner, we use predicted long-term multi-modal trajectories to approximate future updates, which eliminates iterative belief updating and improves the running efficiency. Our approach also incorporates deep Q-learning (DQN) as a search prior, which significantly improves the performance of the MCTS planner. Experimental results from simulated environments validate the effectiveness of our proposed method. The online belief update model can significantly enhance the accuracy and temporal consistency of predictions, leading to improved decision-making performance. Employing DQN as a search prior in the MCTS planner considerably boosts its performance and outperforms an imitation learning-based prior. Additionally, we show that the MCTS planning with macro actions substantially outperforms the vanilla method in terms of performance and efficiency. △ Less

Submitted 17 June, 2024; v1 submitted 27 January, 2024; originally announced January 2024.

Comments: IEEE Robotics and Automation Letters

arXiv:2401.00391 [pdf, other]

SAFE-SIM: Safety-Critical Closed-Loop Traffic Simulation with Controllable Adversaries

Authors: Wei-Jer Chang, Francesco Pittaluga, Masayoshi Tomizuka, Wei Zhan, Manmohan Chandraker

Abstract: Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail safety-critical traffic scenarios. However, traditional methods for generating such scenarios often fall short in terms of controllability and realism and neglect the dynamics of agent interactions. To mitigate these limitations, we introduce SAFE-SIM, a novel diffusion-based controllable closed-… ▽ More Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail safety-critical traffic scenarios. However, traditional methods for generating such scenarios often fall short in terms of controllability and realism and neglect the dynamics of agent interactions. To mitigate these limitations, we introduce SAFE-SIM, a novel diffusion-based controllable closed-loop safety-critical simulation framework. Our approach yields two distinct advantages: 1) the generation of realistic long-tail safety-critical scenarios that closely emulate real-world conditions, and 2) enhanced controllability, enabling more comprehensive and interactive evaluations. We develop a novel approach to simulate safety-critical scenarios through an adversarial term in the denoising process, which allows an adversarial agent to challenge a planner with plausible maneuvers while all agents in the scene exhibit reactive and realistic behaviors. Furthermore, we propose novel guidance objectives and a partial diffusion process that enables a user to control key aspects of the generated scenarios, such as the collision type and aggressiveness of the adversarial driver, while maintaining the realism of the behavior. We validate our framework empirically using the NuScenes dataset, demonstrating improvements in both realism and controllability. These findings affirm that diffusion models provide a robust and versatile foundation for safety-critical, interactive traffic simulation, extending their utility across the broader landscape of autonomous driving. For supplementary videos, visit our project at https://safe-sim.github.io/. △ Less

Submitted 15 June, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

Comments: Under Review

ACM Class: I.2.9; I.2.6

arXiv:2312.11598 [pdf, other]

SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

Authors: Zhixuan Liang, Yao Mu, Hengbo Ma, Masayoshi Tomizuka, Mingyu Ding, ** Luo

Abstract: Diffusion models have demonstrated strong potential for robotic trajectory planning. However, generating coherent trajectories from high-level instructions remains challenging, especially for long-range composition tasks requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion p… ▽ More Diffusion models have demonstrated strong potential for robotic trajectory planning. However, generating coherent trajectories from high-level instructions remains challenging, especially for long-range composition tasks requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. At the higher level, the skill abstraction module learns discrete, human-understandable skill representations from visual observations and language instructions. These learned skill embeddings are then used to condition the diffusion model to generate customized latent trajectories aligned with the skills. This allows generating diverse state trajectories that adhere to the learnable skills. By integrating skill learning with conditional trajectory generation, SkillDiffuser produces coherent behavior following abstract instructions across diverse tasks. Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser. More visualization results and information could be found on our website. △ Less

Submitted 28 March, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: Accepted by CVPR 2024. Camera ready version. Project page: https://skilldiffuser.github.io/

arXiv:2312.10571 [pdf, other]

Multi-level Reasoning for Robotic Assembly: From Sequence Inference to Contact Selection

Authors: Xinghao Zhu, Devesh K. Jha, Diego Romeres, Lingfeng Sun, Masayoshi Tomizuka, Anoop Cherian

Abstract: Automating the assembly of objects from their parts is a complex problem with innumerable applications in manufacturing, maintenance, and recycling. Unlike existing research, which is limited to target segmentation, pose regression, or using fixed target blueprints, our work presents a holistic multi-level framework for part assembly planning consisting of part assembly sequence inference, part mo… ▽ More Automating the assembly of objects from their parts is a complex problem with innumerable applications in manufacturing, maintenance, and recycling. Unlike existing research, which is limited to target segmentation, pose regression, or using fixed target blueprints, our work presents a holistic multi-level framework for part assembly planning consisting of part assembly sequence inference, part motion planning, and robot contact optimization. We present the Part Assembly Sequence Transformer (PAST) -- a sequence-to-sequence neural network -- to infer assembly sequences recursively from a target blueprint. We then use a motion planner and optimization to generate part movements and contacts. To train PAST, we introduce D4PAS: a large-scale Dataset for Part Assembly Sequences (D4PAS) consisting of physically valid sequences for industrial objects. Experimental results show that our approach generalizes better than prior methods while needing significantly less computational time for inference. △ Less

Submitted 16 December, 2023; originally announced December 2023.

Comments: Supplementary video is available at https://www.youtube.com/watch?v=XNYkWSHkAaU&ab_channel=MitsubishiElectricResearchLabs%28MERL%29

arXiv:2312.06876 [pdf, other]

Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks

Authors: Lingfeng Sun, Devesh K. Jha, Chiori Hori, Siddarth Jain, Radu Corcodel, Xinghao Zhu, Masayoshi Tomizuka, Diego Romeres

Abstract: Designing robotic agents to perform open vocabulary tasks has been the long-standing goal in robotics and AI. Recently, Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. However, planning for these tasks in the presence of uncertainties is challenging as it requires \enquote{chain-of-thought} reasoning, aggregating inform… ▽ More Designing robotic agents to perform open vocabulary tasks has been the long-standing goal in robotics and AI. Recently, Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. However, planning for these tasks in the presence of uncertainties is challenging as it requires \enquote{chain-of-thought} reasoning, aggregating information from the environment, updating state estimates, and generating actions based on the updated state estimates. In this paper, we present an interactive planning technique for partially observable tasks using LLMs. In the proposed method, an LLM is used to collect missing information from the environment using a robot and infer the state of the underlying problem from collected observations while guiding the robot to perform the required actions. We also use a fine-tuned Llama 2 model via self-instruct and compare its performance against a pre-trained LLM like GPT-4. Results are demonstrated on several tasks in simulation as well as real-world environments. A video describing our work along with some results could be found here. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: 22 pages, 4 figures

arXiv:2311.07499 [pdf, other]

Bridging the Sim-to-Real Gap with Dynamic Compliance Tuning for Industrial Insertion

Authors: Xiang Zhang, Masayoshi Tomizuka, Hui Li

Abstract: Contact-rich manipulation tasks often exhibit a large sim-to-real gap. For instance, industrial assembly tasks frequently involve tight insertions where the clearance is less than 0.1 mm and can even be negative when dealing with a deformable receptacle. This narrow clearance leads to complex contact dynamics that are difficult to model accurately in simulation, making it challenging to transfer s… ▽ More Contact-rich manipulation tasks often exhibit a large sim-to-real gap. For instance, industrial assembly tasks frequently involve tight insertions where the clearance is less than 0.1 mm and can even be negative when dealing with a deformable receptacle. This narrow clearance leads to complex contact dynamics that are difficult to model accurately in simulation, making it challenging to transfer simulation-learned policies to real-world robots. In this paper, we propose a novel framework for robustly learning manipulation skills for real-world tasks using simulated data only. Our framework consists of two main components: the "Force Planner" and the "Gain Tuner". The Force Planner plans both the robot motion and desired contact force, while the Gain Tuner dynamically adjusts the compliance control gains to track the desired contact force during task execution. The key insight is that by dynamically adjusting the robot's compliance control gains during task execution, we can modulate contact force in the new environment, thereby generating trajectories similar to those trained in simulation and narrowing the sim-to-real gap. Experimental results show that our method, trained in simulation on a generic square peg-and-hole task, can generalize to a variety of real-world insertion tasks involving narrow and negative clearances, all without requiring any fine-tuning. Videos are available at https://dynamic-compliance.github.io. △ Less

Submitted 1 March, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: Accepted by ICRA 24

arXiv:2311.02527 [pdf, other]

Nonlinear Modeling for Soft Pneumatic Actuators via Data-Driven Parameter Estimation

Authors: Wu-Te Yang, Hannah Stuart, Burak Kurkcu, Masayoshi Tomizuka

Abstract: Precise modeling soft robots remains a challenge due to their infinite-dimensional nature governed by partial differential equations. This paper introduces an innovative approach for modeling soft pneumatic actuators, employing a nonlinear framework through data-driven parameter estimation. The research begins by introducing Ludwick's Law, providing a accurate representation of the large deflectio… ▽ More Precise modeling soft robots remains a challenge due to their infinite-dimensional nature governed by partial differential equations. This paper introduces an innovative approach for modeling soft pneumatic actuators, employing a nonlinear framework through data-driven parameter estimation. The research begins by introducing Ludwick's Law, providing a accurate representation of the large deflections exhibited by soft materials. Three key material properties, namely Young's modulus, tensile stress, and mixed viscosity, are utilized to estimate the parameters inside the nonlinear model using the least squares method. Subsequently, a nonlinear dynamic model for soft actuators is constructed by applying Ludwick's Law. To validate the accuracy and effectiveness of the proposed method, several experiments are performed demonstrating the model's capabilities in predicting the dynamic behavior of soft pneumatic actuators. In conclusion, this work contributes to the advancement of soft pneumatic actuator modeling that represents their nonlinear behavior. △ Less

Submitted 14 February, 2024; v1 submitted 4 November, 2023; originally announced November 2023.

Comments: 7 pages, 8 figures

arXiv:2310.13168 [pdf, other]

Optimized Design of a Soft Actuator Considering Force/Torque, Bendability, and Controllability via an Approximated Structure

Authors: Wu-Te Yang, Burak Kurkcu, Masayoshi Tomizuka

Abstract: This paper introduces a novel design method that enhances the force/torque, bendability, and controllability of soft pneumatic actuators (SPAs). The complex structure of the soft actuator is simplified by approximating it as a cantilever beam. This allows us to derive approximated nonlinear kinematic models and a dynamical model, which is explored to understand the correlation between natural freq… ▽ More This paper introduces a novel design method that enhances the force/torque, bendability, and controllability of soft pneumatic actuators (SPAs). The complex structure of the soft actuator is simplified by approximating it as a cantilever beam. This allows us to derive approximated nonlinear kinematic models and a dynamical model, which is explored to understand the correlation between natural frequency and dimensional parameters of SPA. The design problem is then transformed into an optimization problem, using kinematic equations as the objective function and the dynamical equation as a constraint. By solving this optimization problem, the optimal dimensional parameters are determined. Six prototypes are manufactured to validate the proposed approach. The optimal actuator successfully generates the desired force/torque and bending angle, while its natural frequency remains within the constrained range. This work highlights the potential of using optimization formulation and approximated nonlinear models to boost the performance and dynamical properties of soft pneumatic actuators. △ Less

Submitted 12 April, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

Comments: 18 pages, 11 figures

arXiv:2310.10509 [pdf, other]

Efficient Sim-to-real Transfer of Contact-Rich Manipulation Skills with Online Admittance Residual Learning

Authors: Xiang Zhang, Changhao Wang, Lingfeng Sun, Zheng Wu, Xinghao Zhu, Masayoshi Tomizuka

Abstract: Learning contact-rich manipulation skills is essential. Such skills require the robots to interact with the environment with feasible manipulation trajectories and suitable compliance control parameters to enable safe and stable contact. However, learning these skills is challenging due to data inefficiency in the real world and the sim-to-real gap in simulation. In this paper, we introduce a hybr… ▽ More Learning contact-rich manipulation skills is essential. Such skills require the robots to interact with the environment with feasible manipulation trajectories and suitable compliance control parameters to enable safe and stable contact. However, learning these skills is challenging due to data inefficiency in the real world and the sim-to-real gap in simulation. In this paper, we introduce a hybrid offline-online framework to learn robust manipulation skills. We employ model-free reinforcement learning for the offline phase to obtain the robot motion and compliance control parameters in simulation \RV{with domain randomization}. Subsequently, in the online phase, we learn the residual of the compliance control parameters to maximize robot performance-related criteria with force sensor measurements in real time. To demonstrate the effectiveness and robustness of our approach, we provide comparative results against existing methods for assembly, pivoting, and screwing tasks. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: Conference on Robot Learning (CoRL) 2023

arXiv:2310.09899 [pdf, other]

Generalizable whole-body global manipulation of deformable linear objects by dual-arm robot in 3-D constrained environments

Authors: Mingrui Yu, Kangchen Lv, Changhao Wang, Yongpeng Jiang, Masayoshi Tomizuka, Xiang Li

Abstract: Constrained environments are common in practical applications of manipulating deformable linear objects (DLOs), where movements of both DLOs and robots should be constrained. This task is high-dimensional and highly constrained owing to the highly deformable DLOs, dual-arm robots with high degrees of freedom, and 3-D complex environments, which render global planning challenging. Furthermore, accu… ▽ More Constrained environments are common in practical applications of manipulating deformable linear objects (DLOs), where movements of both DLOs and robots should be constrained. This task is high-dimensional and highly constrained owing to the highly deformable DLOs, dual-arm robots with high degrees of freedom, and 3-D complex environments, which render global planning challenging. Furthermore, accurate DLO models needed by planning are often unavailable owing to their strong nonlinearity and diversity, resulting in unreliable planned paths. This article focuses on the global moving and sha** of DLOs in constrained environments by dual-arm robots. The main objectives are 1) to efficiently and accurately accomplish this task, and 2) to achieve generalizable and robust manipulation of various DLOs. To this end, we propose a complementary framework with whole-body planning and control using appropriate DLO model representations. First, a global planner is proposed to efficiently find feasible solutions based on a simplified DLO energy model, which considers the full system states and all constraints to plan more reliable paths. Then, a closed-loop manipulation scheme is proposed to compensate for the modeling errors and enhance the robustness and accuracy, which incorporates a model predictive controller that real-time adjusts the robot motion based on an adaptive DLO motion model. The key novelty is that our framework can efficiently solve the high-dimensional problem subject to multiple constraints and generalize to various DLOs without elaborate model identifications. Experiments demonstrate that our framework can accomplish considerably more complicated tasks than existing works, with significantly higher efficiency, generalizability, and reliability. △ Less

Submitted 15 October, 2023; originally announced October 2023.

Comments: Project website: https://mingrui-yu.github.io/DLO_planning_2

arXiv:2310.08864 [pdf, other]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, A**kya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io. △ Less

Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Project website: https://robotics-transformer-x.github.io

arXiv:2310.07932 [pdf, other]

What Matters to You? Towards Visual Representation Alignment for Robot Learning

Authors: Ran Tian, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy

Abstract: When operating in service of people, robots need to optimize rewards aligned with end-user preferences. Since robots will rely on raw perceptual inputs like RGB images, their rewards will inevitably use visual representations. Recently there has been excitement in using representations from pre-trained visual models, but key to making these work in robotics is fine-tuning, which is typically done… ▽ More When operating in service of people, robots need to optimize rewards aligned with end-user preferences. Since robots will rely on raw perceptual inputs like RGB images, their rewards will inevitably use visual representations. Recently there has been excitement in using representations from pre-trained visual models, but key to making these work in robotics is fine-tuning, which is typically done via proxy tasks like dynamics prediction or enforcing temporal cycle-consistency. However, all these proxy tasks bypass the human's input on what matters to them, exacerbating spurious correlations and ultimately leading to robot behaviors that are misaligned with user preferences. In this work, we propose that robots should leverage human feedback to align their visual representations with the end-user and disentangle what matters for the task. We propose Representation-Aligned Preference-based Learning (RAPL), a method for solving the visual representation alignment problem and visual reward learning problem through the lens of preference-based learning and optimal transport. Across experiments in X-MAGICAL and in robotic manipulation, we find that RAPL's reward consistently generates preferred robot behaviors with high sample efficiency, and shows strong zero-shot generalization when the visual representation is learned from a different embodiment than the robot's. △ Less

Submitted 15 January, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.07218 [pdf, other]

Quantifying Agent Interaction in Multi-agent Reinforcement Learning for Cost-efficient Generalization

Authors: Yuxin Chen, Chen Tang, Ran Tian, Chenran Li, **ning Li, Masayoshi Tomizuka, Wei Zhan

Abstract: Generalization poses a significant challenge in Multi-agent Reinforcement Learning (MARL). The extent to which an agent is influenced by unseen co-players depends on the agent's policy and the specific scenario. A quantitative examination of this relationship sheds light on effectively training agents for diverse scenarios. In this study, we present the Level of Influence (LoI), a metric quantifyi… ▽ More Generalization poses a significant challenge in Multi-agent Reinforcement Learning (MARL). The extent to which an agent is influenced by unseen co-players depends on the agent's policy and the specific scenario. A quantitative examination of this relationship sheds light on effectively training agents for diverse scenarios. In this study, we present the Level of Influence (LoI), a metric quantifying the interaction intensity among agents within a given scenario and environment. We observe that, generally, a more diverse set of co-play agents during training enhances the generalization performance of the ego agent; however, this improvement varies across distinct scenarios and environments. LoI proves effective in predicting these improvement disparities within specific scenarios. Furthermore, we introduce a LoI-guided resource allocation method tailored to train a set of policies for diverse scenarios under a constrained budget. Our results demonstrate that strategic resource allocation based on LoI can achieve higher performance than uniform allocation under the same computation budget. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: 12 pages, 6 figures

ACM Class: I.2.6

arXiv:2310.03888 [pdf, other]

Frequency Domain Analysis of Nonlinear Series Elastic Actuator via Describing Function

Authors: Motohiro Hirao, Burak Kurkcu, Alireza Ghanbarpour, Masayoshi Tomizuka

Abstract: Nonlinear stiffness SEAs (NSEAs) inspired by biological muscles offer promise in achieving adaptable stiffness for assistive robots. While assistive robots are often designed and compared based on torque capability and control bandwidth, NSEAs have not been systematically designed in the frequency domain due to their nonlinearity. The describing function, an analytical concept for nonlinear system… ▽ More Nonlinear stiffness SEAs (NSEAs) inspired by biological muscles offer promise in achieving adaptable stiffness for assistive robots. While assistive robots are often designed and compared based on torque capability and control bandwidth, NSEAs have not been systematically designed in the frequency domain due to their nonlinearity. The describing function, an analytical concept for nonlinear systems, offers a means to understand their behavior in the frequency domain. This paper introduces a frequency domain analysis of nonlinear series elastic actuators using the describing function method. This framework aims to equip researchers and engineers with tools for improved design and control in assistive robotics. △ Less

Submitted 24 October, 2023; v1 submitted 5 October, 2023; originally announced October 2023.

Comments: accepted by 2023 IEEE ROBIO conference

arXiv:2310.03026 [pdf, other]

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

Authors: Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, ** Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, Mingyu Ding

Abstract: Existing learning-based autonomous driving (AD) systems face challenges in comprehending high-level information, generalizing to rare events, and providing interpretability. To address these problems, this work employs Large Language Models (LLMs) as a decision-making component for complex AD scenarios that require human commonsense understanding. We devise cognitive pathways to enable comprehensi… ▽ More Existing learning-based autonomous driving (AD) systems face challenges in comprehending high-level information, generalizing to rare events, and providing interpretability. To address these problems, this work employs Large Language Models (LLMs) as a decision-making component for complex AD scenarios that require human commonsense understanding. We devise cognitive pathways to enable comprehensive reasoning with LLMs, and develop algorithms for translating LLM decisions into actionable driving commands. Through this approach, LLM decisions are seamlessly integrated with low-level controllers by guided parameter matrix adaptation. Extensive experiments demonstrate that our proposed method not only consistently surpasses baseline approaches in single-vehicle tasks, but also helps handle complex driving behaviors even multi-vehicle coordination, thanks to the commonsense reasoning capabilities of LLMs. This paper presents an initial step toward leveraging LLMs as effective decision-makers for intricate AD scenarios in terms of safety, efficiency, generalizability, and interoperability. We aspire for it to serve as inspiration for future research in this field. Project page: https://sites.google.com/view/llm-mpc △ Less

Submitted 13 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

arXiv:2310.03023 [pdf, other]

Human-oriented Representation Learning for Robotic Manipulation

Authors: Mingxiao Huo, Mingyu Ding, Chenfeng Xu, Thomas Tian, Xinghao Zhu, Yao Mu, Lingfeng Sun, Masayoshi Tomizuka, Wei Zhan

Abstract: Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from simultaneously learning about multiple simple perceptual skills that are critical for everyday scenarios (e.g., hand detection, state estimate, etc.) and is better suited fo… ▽ More Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from simultaneously learning about multiple simple perceptual skills that are critical for everyday scenarios (e.g., hand detection, state estimate, etc.) and is better suited for learning robot manipulation policies compared to current state-of-the-art visual representations purely based on self-supervised objectives. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders, where each task is a perceptual skill tied to human-environment interactions. We introduce Task Fusion Decoder as a plug-and-play embedding translator that utilizes the underlying relationships among these perceptual skills to guide the representation learning towards encoding meaningful structure for what's important for all perceptual skills, ultimately empowering learning of downstream robotic manipulation tasks. Extensive experiments across a range of robotic tasks and embodiments, in both simulations and real-world environments, show that our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders including R3M, MVP, and EgoVLP, for downstream manipulation policy-learning. Project page: https://sites.google.com/view/human-oriented-robot-learning △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2310.02648 [pdf, other]

Long-Term Dynamic Window Approach for Kinodynamic Local Planning in Static and Crowd Environments

Authors: Zhiqiang Jian, Songyi Zhang, Lingfeng Sun, Wei Zhan, Nanning Zheng, Masayoshi Tomizuka

Abstract: Local planning for a differential wheeled robot is designed to generate kinodynamic feasible actions that guide the robot to a goal position along the navigation path while avoiding obstacles. Reactive, predictive, and learning-based methods are widely used in local planning. However, few of them can fit static and crowd environments while satisfying kinodynamic constraints simultaneously. To solv… ▽ More Local planning for a differential wheeled robot is designed to generate kinodynamic feasible actions that guide the robot to a goal position along the navigation path while avoiding obstacles. Reactive, predictive, and learning-based methods are widely used in local planning. However, few of them can fit static and crowd environments while satisfying kinodynamic constraints simultaneously. To solve this problem, we propose a novel local planning method. The method applies a long-term dynamic window approach to generate an initial trajectory and then optimizes it with graph optimization. The method can plan actions under the robot's kinodynamic constraints in real time while allowing the generated actions to be safer and more jitterless. Experimental results show that the proposed method adapts well to crowd and static environments and outperforms most SOTA approaches. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: 9 pages, 7 figures

Journal ref: 2023 IEEE RA-L

arXiv:2310.02625 [pdf, other]

Adaptive Spatio-Temporal Voxels Based Trajectory Planning for Autonomous Driving in Highway Traffic Flow

Authors: Zhiqiang Jian, Songyi Zhang, Lingfeng Sun, Wei Zhan, Masayoshi Tomizuka, Nanning Zheng

Abstract: Trajectory planning is crucial for the safe driving of autonomous vehicles in highway traffic flow. Currently, some advanced trajectory planning methods utilize spatio-temporal voxels to construct feasible regions and then convert trajectory planning into optimization problem solving based on the feasible regions. However, these feasible region construction methods cannot adapt to the changes in d… ▽ More Trajectory planning is crucial for the safe driving of autonomous vehicles in highway traffic flow. Currently, some advanced trajectory planning methods utilize spatio-temporal voxels to construct feasible regions and then convert trajectory planning into optimization problem solving based on the feasible regions. However, these feasible region construction methods cannot adapt to the changes in dynamic environments, making them difficult to apply in complex traffic flow. In this paper, we propose a trajectory planning method based on adaptive spatio-temporal voxels which improves the construction of feasible regions and trajectory optimization while maintaining the quadratic programming form. The method can adjust feasible regions and trajectory planning according to real-time traffic flow and environmental changes, realizing vehicles to drive safely in complex traffic flow. The proposed method has been tested in both open-loop and closed-loop environments, and the test results show that our method outperforms the current planning methods. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: 8 pages, 5 figures

Journal ref: IEEE ITSC 2023

arXiv:2310.02264 [pdf, other]

Generalizable Long-Horizon Manipulations with Large Language Models

Authors: Haoyu Zhou, Mingyu Ding, Weikun Peng, Masayoshi Tomizuka, Lin Shao, Chuang Gan

Abstract: This work introduces a framework harnessing the capabilities of Large Language Models (LLMs) to generate primitive task conditions for generalizable long-horizon manipulations with novel objects and unseen tasks. These task conditions serve as guides for the generation and adjustment of Dynamic Movement Primitives (DMP) trajectories for long-horizon task execution. We further create a challenging… ▽ More This work introduces a framework harnessing the capabilities of Large Language Models (LLMs) to generate primitive task conditions for generalizable long-horizon manipulations with novel objects and unseen tasks. These task conditions serve as guides for the generation and adjustment of Dynamic Movement Primitives (DMP) trajectories for long-horizon task execution. We further create a challenging robotic manipulation task suite based on Pybullet for long-horizon task evaluation. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of our framework on both familiar tasks involving new objects and novel but related tasks, highlighting the potential of LLMs in enhancing robotic system versatility and adaptability. Project website: https://object814.github.io/Task-Condition-With-LLM/ △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2310.02262 [pdf, other]

RSRD: A Road Surface Reconstruction Dataset and Benchmark for Safe and Comfortable Autonomous Driving

Authors: Tong Zhao, Chenfeng Xu, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan, Yintao Wei

Abstract: This paper addresses the growing demands for safety and comfort in intelligent robot systems, particularly autonomous vehicles, where road conditions play a pivotal role in overall driving performance. For example, reconstructing road surfaces helps to enhance the analysis and prediction of vehicle responses for motion planning and control systems. We introduce the Road Surface Reconstruction Data… ▽ More This paper addresses the growing demands for safety and comfort in intelligent robot systems, particularly autonomous vehicles, where road conditions play a pivotal role in overall driving performance. For example, reconstructing road surfaces helps to enhance the analysis and prediction of vehicle responses for motion planning and control systems. We introduce the Road Surface Reconstruction Dataset (RSRD), a real-world, high-resolution, and high-precision dataset collected with a specialized platform in diverse driving conditions. It covers common road types containing approximately 16,000 pairs of stereo images, original point clouds, and ground-truth depth/disparity maps, with accurate post-processing pipelines to ensure its quality. Based on RSRD, we further build a comprehensive benchmark for recovering road profiles through depth estimation and stereo matching. Preliminary evaluations with various state-of-the-art methods reveal the effectiveness of our dataset and the challenge of the task, underscoring substantial opportunities of RSRD as a valuable resource for advancing techniques, e.g., multi-view stereo towards safe autonomous driving. The dataset and demo videos are available at https://thu-rsxd.com/rsrd/ △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2310.01740 [pdf, other]

Control of Soft Pneumatic Actuators with Approximated Dynamical Modeling

Authors: Wu-Te Yang, Burak Kurkcu, Motohiro Hirao, Lingfeng Sun, Xinghao Zhu, Zhizhou Zhang, Grace X. Gu, Masayoshi Tomizuka

Abstract: This paper introduces a full system modeling strategy for a syringe pump and soft pneumatic actuators(SPAs). The soft actuator is conceptualized as a beam structure, utilizing a second-order bending model. The equation of natural frequency is derived from Euler's bending theory, while the dam** ratio is estimated by fitting step responses of soft pneumatic actuators. Evaluation of model uncertai… ▽ More This paper introduces a full system modeling strategy for a syringe pump and soft pneumatic actuators(SPAs). The soft actuator is conceptualized as a beam structure, utilizing a second-order bending model. The equation of natural frequency is derived from Euler's bending theory, while the dam** ratio is estimated by fitting step responses of soft pneumatic actuators. Evaluation of model uncertainty underscores the robustness of our modeling methodology. To validate our approach, we deploy it across four prototypes varying in dimensional parameters. Furthermore, a syringe pump is designed to drive the actuator, and a pressure model is proposed to construct a full system model. By employing this full system model, the Linear-Quadratic Regulator (LQR) controller is implemented to control the soft actuator, achieving high-speed responses and high accuracy in both step response and square wave function response tests. Both the modeling method and the LQR controller are thoroughly evaluated through experiments. Lastly, a gripper, consisting of two actuators with a feedback controller, demonstrates stable gras** of delicate objects with a significantly higher success rate. △ Less

Submitted 19 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: 8 pages, 10 figures, accepted by 2023 IEEE ROBIO conference

arXiv:2310.01614 [pdf, other]

Distributed Multi-agent Interaction Generation with Imagined Potential Games

Authors: Lingfeng Sun, Pin-Yun Hung, Changhao Wang, Masayoshi Tomizuka, Zhuo Xu

Abstract: Interactive behavior modeling of multiple agents is an essential challenge in simulation, especially in scenarios when agents need to avoid collisions and cooperate at the same time. Humans can interact with others without explicit communication and navigate in scenarios when cooperation is required. In this work, we aim to model human interactions in this realistic setting, where each agent acts… ▽ More Interactive behavior modeling of multiple agents is an essential challenge in simulation, especially in scenarios when agents need to avoid collisions and cooperate at the same time. Humans can interact with others without explicit communication and navigate in scenarios when cooperation is required. In this work, we aim to model human interactions in this realistic setting, where each agent acts based on its observation and does not communicate with others. We propose a framework based on distributed potential games, where each agent imagines a cooperative game with other agents and solves the game using its estimation of their behavior. We utilize iLQR to solve the games and closed-loop simulate the interactions. We demonstrate the benefits of utilizing distributed imagined games in our framework through various simulation experiments. We show the high success rate, the increased navigation efficiency, and the ability to generate rich and realistic interactions with interpretable parameters. Illustrative examples are available at https://sites.google.com/berkeley.edu/distributed-interaction. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: 8 pages, 7 figures

arXiv:2309.17342 [pdf, other]

Towards Free Data Selection with General-Purpose Models

Authors: Yichen Xie, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan

Abstract: A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets. However, current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly. In this paper, we challenge this status quo by designing a dist… ▽ More A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets. However, current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly. In this paper, we challenge this status quo by designing a distinct data selection pipeline that utilizes existing general-purpose models to select data from various datasets with a single-pass inference without the need for additional training or supervision. A novel free data selection (FreeSel) method is proposed following this new pipeline. Specifically, we define semantic patterns extracted from inter-mediate features of the general-purpose model to capture subtle local information in each image. We then enable the selection of all data samples in a single pass through distance-based sampling at the fine-grained semantic pattern level. FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods. Extensive experiments verify the effectiveness of FreeSel on various computer vision tasks. Our code is available at https://github.com/yichen928/FreeSel. △ Less

Submitted 14 October, 2023; v1 submitted 29 September, 2023; originally announced September 2023.

Comments: accepted by NeurIPS 2023

arXiv:2309.10121 [pdf, other]

Pre-training on Synthetic Driving Data for Trajectory Prediction

Authors: Yiheng Li, Seth Z. Zhao, Chenfeng Xu, Chen Tang, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan

Abstract: Accumulating substantial volumes of real-world driving data proves pivotal in the realm of trajectory forecasting for autonomous driving. Given the heavy reliance of current trajectory forecasting models on data-driven methodologies, we aim to tackle the challenge of learning general trajectory forecasting representations under limited data availability. We propose to augment both HD maps and traj… ▽ More Accumulating substantial volumes of real-world driving data proves pivotal in the realm of trajectory forecasting for autonomous driving. Given the heavy reliance of current trajectory forecasting models on data-driven methodologies, we aim to tackle the challenge of learning general trajectory forecasting representations under limited data availability. We propose to augment both HD maps and trajectories and apply pre-training strategies on top of them. Specifically, we take advantage of graph representations of HD-map and apply vector transformations to reshape the maps, to easily enrich the limited number of scenes. Additionally, we employ a rule-based model to generate trajectories based on augmented scenes; thus enlarging the trajectories beyond the collected real ones. To foster the learning of general representations within this augmented dataset, we comprehensively explore the different pre-training strategies, including extending the concept of a Masked AutoEncoder (MAE) for trajectory forecasting. Extensive experiments demonstrate the effectiveness of our data expansion and pre-training strategies, which outperform the baseline prediction model by large margins, e.g. 5.04%, 3.84% and 8.30% in terms of $MR_6$, $minADE_6$ and $minFDE_6$. △ Less

Submitted 19 September, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

Showing 1–50 of 208 results for author: Tomizuka, M