RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton

Liu, Fanfan; Yan, Feng; Zheng, Liming; Feng, Chengjian; Huang, Yiyang; Ma, Lin

Computer Science > Robotics

arXiv:2406.18977 (cs)

[Submitted on 27 Jun 2024]

Title:RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton

Authors:Fanfan Liu, Feng Yan, Liming Zheng, Chengjian Feng, Yiyang Huang, Lin Ma

View PDF HTML (experimental)

Abstract:Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D \to D$ setting from 88.7% to 96.2%, and in the $ABC \to D$ setting from 82.4% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. this https URL

Subjects:	Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.18977 [cs.RO]
	(or arXiv:2406.18977v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2406.18977

Submission history

From: Fanfan Liu [view email]
[v1] Thu, 27 Jun 2024 08:13:33 UTC (5,803 KB)

Computer Science > Robotics

Title:RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators