HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: picinpar
  • failed: pbox
  • failed: romannum

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2401.11432v1 [cs.RO] 21 Jan 2024

Bimanual Deformable Bag Manipulation Using a Structure-of-Interest Based Latent Dynamics Model

Peng Zhou11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, , Pai Zheng22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, , Jiaming Qi11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Chenxi Li22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Chenguang Yang33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, , David Navarro-Alarcon22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, , and Jia Pan11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT This work is supported by the Innovation and Technology Commission of the HKSAR Government under the InnoHK initiative. (Corresponding author: Jia Pan.)11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTThe authors are with The University of Hong Kong, HK, Hong Kong. {jeffzhou,tomqi,jpan}@hku.hk22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTThe authors are with The Hong Kong Polytechnic University, KLN, Hong Kong. {pai.zheng,dnavar}@polyu.edu.hk, [email protected]33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTThe author is with University of Liverpool, England, UK. [email protected]
Abstract

The manipulation of deformable objects by robotic systems presents a significant challenge due to their complex and infinite-dimensional configuration spaces. This paper introduces a novel approach to Deformable Object Manipulation (DOM) by emphasizing the identification and manipulation of Structures of Interest (SOIs) in deformable fabric bags. We propose a bimanual manipulation framework that leverages a Graph Neural Network (GNN)-based latent dynamics model to succinctly represent and predict the behavior of these SOIs. Our approach involves constructing a graph representation from partial point cloud data of the object and learning the latent dynamics model that effectively captures the essential deformations of the fabric bag within a reduced computational space. By integrating this latent dynamics model with Model Predictive Control (MPC), we empower robotic manipulators to perform precise and stable manipulation tasks focused on the SOIs. We have validated our framework through various empirical experiments demonstrating its efficacy in bimanual manipulation of fabric bags. Our contributions not only address the complexities inherent in DOM but also provide new perspectives and methodologies for enhancing robotic interactions with deformable objects by concentrating on their critical structural elements. Experimental videos can be obtained from https://sites.google.com/view/bagbot

Index Terms:
Deformable object manipulation, structure of interest, latent dynamics model, bimanual manipulation.

I Introduction

Deformable object manipulation (DOM) [1, 2, 3] is a fundamental capability for robots to meaningfully interact with the physical world and assist in various human tasks. However, the manipulation of deformable objects such as cloth [4], rope [5], and food ingredients[6] is particularly challenging due to their infinite-dimensional configuration space and complex dynamics. Traditional methods in DOM often resort to simplified physics models or data-driven modeling with handcrafted features, which lack adaptability for the varied shapes and dynamics of these objects [7]. Moreover, most current DOM works focus on manipulating the entire object, neglecting the critical structures, i.e., Structures of Interest (SOI), that are essential for subsequent manipulation steps.

Refer to caption
Figure 1: (Left) SOI Examples for different deformable object manipulation tasks, e.g., garment hanging, robot-assistive dressing. (Right) Conceptual representation of the manifold with boundary. The manifold encompasses IntInt\operatorname{Int}\mathcal{M}roman_Int caligraphic_M and \partial\mathcal{M}∂ caligraphic_M, where the local neighborhoods of points in IntInt\operatorname{Int}\mathcal{M}roman_Int caligraphic_M and {\partial\mathcal{M}}∂ caligraphic_M are homeomorphically equivalent to IntnIntsuperscript𝑛\operatorname{Int}\mathbb{H}^{n}roman_Int blackboard_H start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and nsuperscript𝑛\partial\mathbb{H}^{n}∂ blackboard_H start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.
Refer to caption
Figure 2: The two robots grasp two handles of a fabric bag to manipulate the SOI (i.e., the opening rim) into the target configuration.

In this paper, we introduce the concept of SOI into the realm of DOM (see Fig. 1 for an example), a paradigm shift that emphasizes the importance of identifying and manipulating key structural components rather than the entire object. This focus on the SOI is motivated by the observation that successful DOM tasks typically involve the manipulation of these key areas. By targeting these SOIs, we can reduce the computational load significantly, as modeling the complete 3D dynamics of the deformable object is unnecessary and burdensome for the task at hand.

Furthermore, this work is pioneering in considering the bimanual manipulation of a deformable fabric bag as shown in Fig. 2. By proposing a novel bimanual manipulation framework using a GNN-based latent dynamics model, we address the complexities of DOM with a focus on SOIs. Our approach is designed to extract the SOI from the observed object point cloud, construct a graph representation, and learn the latent dynamics model that effectively captures the object’s deformations within a compact space. Integrating this model with model predictive control (MPC), we enable robots to achieve accurate and stable manipulation of deformable bags.

The main contributions of this work include:

  • Pioneering the bimanual manipulation of deformable fabric bags by focusing on SOIs.

  • Introducing the SOI concept for representing deformable object states, which emphasizes the manipulation of critical structures.

  • Designing a GNN-based method to learn a latent dynamics model from partial point cloud data, particularly focusing on the SOIs.

  • Implementing MPC based on the latent dynamics for generating optimal manipulation actions centered around SOIs.

Various empirical experiments on the bimanual manipulation of a fabric bag validate the efficacy of our proposed framework. Our work offers new insights and methodologies for improving robotic systems’ capability for intelligent deformable object manipulation, with a specialized emphasis on the pivotal SOIs.

Refer to caption
Figure 3: Bimanual bag manipulation is formulated as a POMDP problem, where the SOI-related observation o^tsubscript^𝑜𝑡\hat{o}_{t}over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is extracted from the original observation otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and governed by fθdymsubscript𝑓subscript𝜃dymf_{\theta_{\mathrm{dym}}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_dym end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

II Related Work

Deformable object manipulation has been an active research area in robotics. Existing methods can be categorized into model-based and data-driven approaches. Model-based methods rely on simplified physics models to represent deformable objects. Early works used mass-spring models (MSM) to simulate deformation [8, 9]. The finite element method (FEM) provides more accurate modeling of continuum mechanics [10, 11, 12]. However, analytical models require extensive manual tuning and generalization across different materials or shapes remains difficult. Data-driven methods [7, 13, 14] aim to learn models directly from data. Vision-based methods extract geometric features from visual observations to infer deformations [15, 16]. Recent works utilized deep learning on point cloud data and achieved improved modeling accuracy [17]. However, they depend heavily on large labeled datasets. Self-supervised methods were proposed to learn from physical interactions [18, 19]. However, they focused on planar objects and could not handle complex deformations.

Our work is mostly related to robot manipulation using graph neural networks (GNNs). GNNs have shown promising results in learning physics simulations [20, 21, 22, 23, 24] and control policies [25, 26, 27, 28]. Recently, GNNs were introduced to model rope and cloth manipulation [29, 30, 31]. Different from these works, we propose the structure of interest concept to focus on key object components and employ GNNs to learn latent dynamics models for describing complex deformable object behaviors. The integration of the latent dynamics model with MPC also distinguishes our framework from prior arts. In summary, our work aims to advance existing deformable object manipulation by introducing an SOI-based modeling approach. The adoption of state-of-the-art deep learning techniques allows better generalizability across different materials and shapes. The experiments on the bimanual manipulation of fabric bags represent challenging test cases and validate the feasibility of our framework.

III Problem Statement

Given that individual image and depth observations generally do not fully disclose the state of the environment, we approach the task of bimanual bag manipulation as a Partially Observable Markov Decision Process (POMDP) as depicted in Fig. 3. This is formally defined by the tuple (𝒮,𝒜,𝒯,𝒪,,,γ)𝒮𝒜𝒯𝒪𝛾(\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{O},\mathcal{B},\mathcal{R},\gamma)( caligraphic_S , caligraphic_A , caligraphic_T , caligraphic_O , caligraphic_B , caligraphic_R , italic_γ ), where the state at time t𝑡titalic_t, denoted by stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, belongs to the state space 𝒮𝒮\mathcal{S}caligraphic_S and is not directly observable. The state encapsulates the configuration of the robots and the manipulated object. The corresponding observation at time t𝑡titalic_t, denoted by otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, is within the observation space 𝒪𝒪\mathcal{O}caligraphic_O. The state transition model 𝒯(st+1st,at)𝒯conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡\mathcal{T}(s_{t+1}\mid s_{t},a_{t})caligraphic_T ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) describes the probability of transitioning from the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to a new state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT upon taking an action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the action space 𝒜𝒜\mathcal{A}caligraphic_A, which consists of the combined left and right robotic actions, represented by the Cartesian product 𝒜=𝒜1×𝒜2𝒜superscript𝒜1superscript𝒜2\mathcal{A}=\mathcal{A}^{1}\times\mathcal{A}^{2}caligraphic_A = caligraphic_A start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT × caligraphic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The function (otst,at1)conditionalsubscript𝑜𝑡subscript𝑠𝑡subscript𝑎𝑡1\mathcal{B}(o_{t}\mid s_{t},a_{t-1})caligraphic_B ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) specifies the likelihood of observing otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT after executing action at1subscript𝑎𝑡1a_{t-1}italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and transitioning to state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The reward function (st,at)subscript𝑠𝑡subscript𝑎𝑡\mathcal{R}(s_{t},a_{t})caligraphic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) assigns a valued reward to each state-action pair, and the discount factor γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) quantifies the preference for immediate rewards over future rewards.

The goal of this work is to use two 3D-printed robot grippers to grasp the handles to manipulate the opening rim to achieve a target state 𝐠𝐠\mathbf{g}bold_g. We assume this deformable bag manipulation task is a quasi-static manipulation and dynamic manipulation motions are not considered. At time step t𝑡titalic_t, the robots apply action (𝐚t1,𝐚t2)𝒜subscriptsuperscript𝐚1𝑡subscriptsuperscript𝐚2𝑡𝒜(\mathbf{a}^{1}_{t},\mathbf{a}^{2}_{t})\in\mathcal{A}( bold_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_A upon the bag, and we can partially observe transitions of the bag from 𝐨tsubscript𝐨𝑡\mathbf{o}_{t}bold_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝐨t+1subscript𝐨𝑡1\mathbf{o}_{t+1}bold_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT under the unknown state transitions from 𝐬tsubscript𝐬𝑡\mathbf{s}_{t}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝐬t+1subscript𝐬𝑡1\mathbf{s}_{t+1}bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. However, a complete observation of the bag is not necessary, in our task, the opening rim of the bag is critical for successful manipulation tasks since it not only determines the manipulation task goals but also provides the most informative sensory feedback, such as visual landmarks, during manipulation. We define the Structure of Interest (SOI) in the context of Deformable Object Manipulation (DOM) refers to specific regions or features of a deformable object that are critical for successful manipulation tasks (see Fig. 1 for examples.). Therefore, in this task, we consider the opening rim of the manipulated bag as our SOI points, and topologically, we can define this loop-like structure as a manifold with boundary. As illustrated in Fig. 1, we also define its interior and boundary as IntInt\operatorname{Int}\mathcal{M}roman_Int caligraphic_M and {\partial\mathcal{M}}∂ caligraphic_M, whose points’ neighborhoods are respectively homeomorphic to Intn={(x1,,xn)xn>0}Intsuperscript𝑛conditional-setsubscript𝑥1subscript𝑥𝑛subscript𝑥𝑛0\operatorname{Int}\mathbb{H}^{n}=\{(x_{1},\ldots,x_{n})\mid x_{n}>0\}roman_Int blackboard_H start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∣ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0 } and n={(x1,,xn)xn=0}superscript𝑛conditional-setsubscript𝑥1subscript𝑥𝑛subscript𝑥𝑛0\partial\mathbb{H}^{n}=\{(x_{1},\ldots,x_{n})\mid x_{n}=0\}∂ blackboard_H start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∣ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0 }.

With an appropriate perception module, the observation of the Structure of Interest (SOI), denoted by o^tsubscript^𝑜𝑡\hat{o}_{t}over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, can be extracted from the overall observation of the bag otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Our approach is based on the insight that it is more efficacious to predict the dynamics of the SOI rather than the entire complex dynamics of the bag. To this end, we employ a Graph Neural Network (GNN) to establish a dynamics model fθdymsubscript𝑓subscript𝜃dymf_{\theta_{\text{dym}}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT dym end_POSTSUBSCRIPT end_POSTSUBSCRIPT that is dedicated to learning the transition functions of the SOI, defined as fθdym:𝒪^×𝒜𝒪^:subscript𝑓subscript𝜃dym^𝒪𝒜^𝒪f_{\theta_{\text{dym}}}:\hat{\mathcal{O}}\times\mathcal{A}\rightarrow\hat{% \mathcal{O}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT dym end_POSTSUBSCRIPT end_POSTSUBSCRIPT : over^ start_ARG caligraphic_O end_ARG × caligraphic_A → over^ start_ARG caligraphic_O end_ARG. This dynamics model accepts as input a sequence of SOI observations 𝐨^tn,,t𝒪^subscript^𝐨𝑡𝑛𝑡^𝒪\mathbf{\hat{o}}_{t-n,\ldots,t}\in\mathcal{\hat{O}}over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT italic_t - italic_n , … , italic_t end_POSTSUBSCRIPT ∈ over^ start_ARG caligraphic_O end_ARG and actions (𝐚tn,,t1,𝐚tn,,t2)𝒜subscriptsuperscript𝐚1𝑡𝑛𝑡subscriptsuperscript𝐚2𝑡𝑛𝑡𝒜(\mathbf{a}^{1}_{t-n,\ldots,t},\mathbf{a}^{2}_{t-n,\ldots,t})\in\mathcal{A}( bold_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_n , … , italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_n , … , italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_A, and predicts the subsequent observation 𝐨^t+1subscript^𝐨𝑡1\mathbf{\hat{o}}_{t+1}over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, where n𝑛nitalic_n represents the length of the observation history before the current time step t𝑡titalic_t. With the dynamics model, we proceed to cast the bimanual manipulation of the bag as a task within the Model Predictive Control (MPC) framework. Within this MPC setup, the cost function 𝒥𝒥\mathcal{J}caligraphic_J quantifies the difference between the final SOI feature points at time step T𝑇Titalic_T and the targeted SOI state 𝐠𝐠\mathbf{g}bold_g. Details on the precise structure of the cost function 𝒥𝒥\mathcal{J}caligraphic_J are illustrated in the following section. This cost is minimized to yield an optimal sequence of actions across a temporal horizon of T𝑇Titalic_T steps:

𝐚0,𝐚10:T1subscriptsuperscript𝐚0superscript𝐚1:0𝑇1\displaystyle\langle\mathbf{a}^{0},~{}\mathbf{a}^{1}\rangle_{0:{T-1}}⟨ bold_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT 0 : italic_T - 1 end_POSTSUBSCRIPT =argmin𝐚0,𝐚10:T1𝒜𝒥(𝐠^,𝐠)absentsubscriptsuperscript𝐚0superscript𝐚1:0𝑇1𝒜𝒥^𝐠𝐠\displaystyle=\underset{\langle\mathbf{a}^{0},\mathbf{a}^{1}\rangle_{0:T-1}\in% \mathcal{A}}{\arg\min}\mathcal{J}(\mathbf{\hat{g}},\mathbf{g})= start_UNDERACCENT ⟨ bold_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT 0 : italic_T - 1 end_POSTSUBSCRIPT ∈ caligraphic_A end_UNDERACCENT start_ARG roman_arg roman_min end_ARG caligraphic_J ( over^ start_ARG bold_g end_ARG , bold_g ) (1)
where𝐠^where^𝐠\displaystyle\text{where}\quad\mathbf{\hat{g}}where over^ start_ARG bold_g end_ARG =fθdym(𝐨^0,𝐚0,𝐚10:T1)absentsubscript𝑓subscript𝜃𝑑𝑦𝑚subscript^𝐨0subscriptsuperscript𝐚0superscript𝐚1:0𝑇1\displaystyle=f_{\theta_{dym}}\left(\mathbf{\hat{o}}_{0},\langle\mathbf{a}^{0}% ,\mathbf{a}^{1}\rangle_{0:{T-1}}\right)= italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_d italic_y italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⟨ bold_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT 0 : italic_T - 1 end_POSTSUBSCRIPT )

where 𝐠^^𝐠\hat{\mathbf{g}}over^ start_ARG bold_g end_ARG represents the predicted SOI state at time step T𝑇Titalic_T. Fig. 4 shows the overall framework of our proposed latent dynamics model for the bimanual bag manipulation task.

Refer to caption
Figure 4: The conceptual representation of the proposed framework for bimanual deformable fabric bag manipulation is based on the latent SOI dynamics model.
Refer to caption
Figure 5: The proposed SOI global particle sampling process. Left: the reconstructed point cloud for the fabric bag. Right: (a) raw SOI point cloud. (b) preprocessed SOI point cloud. (c) Reconstructed SOI Surface. (d) Resampled SOI particles.

IV Methodology

IV-A Global Particle Sampling from Raw Observation

In this work, we begin by sampling representative particles from raw visual inputs, i.e., RGB-D images, to facilitate the training of Graph Neural Network (GNN)-based dynamics model, depicted in Fig. 4(a). The challenge lies in extracting meaningful particle representations of the Structure of Interest (SOI) from visual data that is significantly occluded. To this end, we introduce a comprehensive global particle sampling methodology, as illustrated in Fig. 5.

Preprocessing: The initial phase of our methodology involves refining the raw point cloud data obtained from RealSense D435 cameras. With appropriate calibration, we can transform the raw RGBD images into point cloud data with the cameras’ intrinsic and extrinsic parameters. Through meticulous calibration, we can convert raw RGB-D images into point cloud data, utilizing the intrinsic and extrinsic parameters of the cameras. To enhance data quality, a statistical outlier removal filter is applied to strip away noise, paving the way for accurate analysis. Focusing on the bag’s opening rim, which is crafted from green fabric, we employ a color-based segmentation algorithm. A specific color threshold within the HSV color space is set to discern green hues, effectively isolating the rim from the rest of the bag. Concurrently, we identify and omit the points marking the bag’s handle areas where robots will interact with the fabric using a separate HSV threshold. To ensure the rim’s point cloud is distinctly segregated from other bag components, Euclidean clustering is performed as a concluding step, particularly if handle detection leaves any connected segments. This meticulous process results in a purified dataset, optimal for the rim’s geometric representation.

Surface Reconstruction: To reconstruct the green rim’s surface from its point cloud, we leverage Poisson reconstruction, a preferred technique for its proficiency in managing occlusions by interpolating over incomplete data to form a smooth, continuous surface. Traditional methods like ball pivoting and alpha shapes underperform in occluded scenarios. Before reconstruction, we compute point normal via local principal axis analysis to guide the Poisson algorithm. The resulting surface is further polished with a uniform resampling algorithm, ensuring mesh uniformity and feature preservation for subsequent modeling and manipulation tasks.

Refinement with Topological Prior: In this step, we ensure that the reconstructed surface conforms to the anticipated topological structure of the bag’s opening rim through a two-fold approach: 1) Topological Analysis: We incorporate topological priors by recognizing that the rim should form a contiguous loop. This knowledge is employed to scrutinize the mesh, detecting any topological discrepancies. 2) Corrective Actions: Upon identifying any topological deviations, we undertake corrective measures by cutting and remeshing to realign the mesh with a loop topology that accurately represents the bag’s rim.

Resampling: In the final stage of our process, our objective is to generate a high-quality particle set that accurately represents the geometry of the rim. To achieve this, we initially implement voxel grid down-sampling to attain a uniform distribution of vertices while concurrently eliminating statistical outliers. Subsequently, we employ the farthest point sampling technique to selectively reduce the point cloud associated with the Structure of Interest (SOI) to a manageable quantity conducive to Graph Neural Network (GNN) training. The culmination of this process, resulting in a refined particle dataset, is depicted in Fig. 5(d).

IV-B SOI Dynamics Model

To characterize the dynamics of SOI particles, as illustrated in Fig. 4(b), we begin to construct a particle graph and introduce the graph neural networks to model the SOI dynamics. To construct a particle graph, we denote a graph representing SOI observations as 𝒢(𝐨^t)=(Vt,Et)𝒢subscript^𝐨𝑡subscript𝑉𝑡subscript𝐸𝑡\mathcal{G}(\mathbf{\hat{o}}_{t})=(V_{t},E_{t})caligraphic_G ( over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and the graph vertices Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT correspond to the SOI’s particles 𝐩i,tsubscript𝐩𝑖𝑡\mathbf{p}_{i,t}bold_p start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT. Each particle is expressed as 𝐩i,t=𝐱i,t,𝐛i,tsubscript𝐩𝑖𝑡subscript𝐱𝑖𝑡subscript𝐛𝑖𝑡\mathbf{p}_{i,t}=\langle\mathbf{x}_{i,t},\mathbf{b}_{i,t}\ranglebold_p start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = ⟨ bold_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ⟩, where the position and attributes of the particle i𝑖iitalic_i at time t𝑡titalic_t are denoted as 𝐱i,tsubscript𝐱𝑖𝑡\mathbf{x}_{i,t}bold_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT and 𝐛i,tsubscript𝐛𝑖𝑡\mathbf{b}_{i,t}bold_b start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, respectively. Edges Etsubscript𝐸𝑡E_{t}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT link vertices dynamically, based on spatial relationships, connecting all neighboring particles within a predefined range. Edge relations are captured by 𝐞j=mj,nj,𝐜jsubscript𝐞𝑗subscript𝑚𝑗subscript𝑛𝑗subscript𝐜𝑗\mathbf{e}_{j}=\langle m_{j},n_{j},\mathbf{c}_{j}\ranglebold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ⟨ italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩, with 1mj,nj|𝐏t|formulae-sequence1subscript𝑚𝑗subscript𝑛𝑗subscript𝐏𝑡1\leq m_{j},n_{j}\leq|\mathbf{P}_{t}|1 ≤ italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ | bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | denoting the indices of the connected particles, j𝑗jitalic_j being the edge index, and 𝐜jsubscript𝐜𝑗\mathbf{c}_{j}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT describing the type of connection, be it internal structural or handle-to-rim linkages.

The goal of introducing GNNs is to simulate the dynamics of the SOI and to predict subsequent states from a short historical sequence of SOI particle graphs, formalized as:

𝒢(𝐨^t+1)=fθdyn(𝒢(𝐨^th:t),𝐚0,𝐚1t)𝒢subscript^𝐨𝑡1subscript𝑓subscript𝜃dyn𝒢subscript^𝐨:𝑡𝑡subscriptsuperscript𝐚0superscript𝐚1𝑡\mathcal{G}(\mathbf{\hat{o}}_{t+1})=f_{\theta_{\text{dyn}}}\left(\mathcal{G}(% \mathbf{\hat{o}}_{t-h:t}),\langle\mathbf{a}^{0},~{}\mathbf{a}^{1}\rangle_{t}\right)caligraphic_G ( over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT dyn end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_G ( over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT italic_t - italic_h : italic_t end_POSTSUBSCRIPT ) , ⟨ bold_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (2)

To facilitate this, the original high-dimensional observation space concerning the SOI is reduced into a condensed, low-dimensional latent graph by encoding the distinct particle and connection features of the SOI. The loss function of each encoder is defined based on a distance metric as:

ϕ𝐩,ψ𝐩superscriptitalic-ϕ𝐩superscript𝜓𝐩\displaystyle\phi^{\mathbf{p}},\psi^{\mathbf{p}}italic_ϕ start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT , italic_ψ start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT =argminϕ𝐯,ψ𝐯Dist[V(ϕ𝐩ψ𝐩)(P)]absentsubscriptsuperscriptitalic-ϕ𝐯superscript𝜓𝐯Dist𝑉superscriptitalic-ϕ𝐩superscript𝜓𝐩𝑃\displaystyle=\arg\min_{\phi^{\mathbf{v}},\psi^{\mathbf{v}}}\operatorname{Dist% }[V-(\phi^{\mathbf{p}}\circ\psi^{\mathbf{p}})(P)]= roman_arg roman_min start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT bold_v end_POSTSUPERSCRIPT , italic_ψ start_POSTSUPERSCRIPT bold_v end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Dist [ italic_V - ( italic_ϕ start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT ∘ italic_ψ start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT ) ( italic_P ) ] (3)
ϕ𝐞,ψ𝐞superscriptitalic-ϕ𝐞superscript𝜓𝐞\displaystyle\phi^{\mathbf{e}},\psi^{\mathbf{e}}italic_ϕ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT , italic_ψ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT =argminϕ𝐞,ψ𝐞Dist[E(ϕ𝐞ψ𝐞)(E)]absentsubscriptsuperscriptitalic-ϕ𝐞superscript𝜓𝐞Dist𝐸superscriptitalic-ϕ𝐞superscript𝜓𝐞𝐸\displaystyle=\arg\min_{\phi^{\mathbf{e}},\psi^{\mathbf{e}}}\operatorname{Dist% }[E-(\phi^{\mathbf{e}}\circ\psi^{\mathbf{e}})(E)]= roman_arg roman_min start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT , italic_ψ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_Dist [ italic_E - ( italic_ϕ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT ∘ italic_ψ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT ) ( italic_E ) ]

Here, ϕ𝐩:VZV:superscriptitalic-ϕ𝐩𝑉subscript𝑍𝑉\phi^{\mathbf{p}}:V\rightarrow Z_{V}italic_ϕ start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT : italic_V → italic_Z start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and ϕ𝐞:EZE:superscriptitalic-ϕ𝐞𝐸subscript𝑍𝐸\phi^{\mathbf{e}}:E\rightarrow Z_{E}italic_ϕ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT : italic_E → italic_Z start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT serve as the encoders for SOI particles and edges, while ψ𝐩:ZVV:superscript𝜓𝐩subscript𝑍𝑉𝑉\psi^{\mathbf{p}}:Z_{V}\rightarrow Vitalic_ψ start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT : italic_Z start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT → italic_V and ψ𝐞:ZEE:superscript𝜓𝐞subscript𝑍𝐸𝐸\psi^{\mathbf{e}}:Z_{E}\rightarrow Eitalic_ψ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT : italic_Z start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT → italic_E function as their respective decoders. The encoding of SOI particle and edge features via ϕ𝐩superscriptitalic-ϕ𝐩\phi^{\mathbf{p}}italic_ϕ start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT and ϕ𝐞superscriptitalic-ϕ𝐞\phi^{\mathbf{e}}italic_ϕ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT is executed as follows:

zi,t𝐩superscriptsubscript𝑧𝑖𝑡𝐩\displaystyle z_{i,t}^{\mathbf{p}}italic_z start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT =ϕ𝐩(𝐩i,t)absentsuperscriptitalic-ϕ𝐩subscript𝐩𝑖𝑡\displaystyle=\phi^{\mathbf{p}}(\mathbf{p}_{i,t})= italic_ϕ start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT ( bold_p start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) (4)
zj,t𝐞superscriptsubscript𝑧𝑗𝑡𝐞\displaystyle z_{j,t}^{\mathbf{e}}italic_z start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT =ϕ𝐞(𝐩mj,t,𝐩nj,t,𝐜j)absentsuperscriptitalic-ϕ𝐞subscript𝐩subscript𝑚𝑗𝑡subscript𝐩subscript𝑛𝑗𝑡subscript𝐜𝑗\displaystyle=\phi^{\mathbf{e}}(\mathbf{p}_{m_{j},t},\mathbf{p}_{n_{j},t},% \mathbf{c}_{j})= italic_ϕ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT ( bold_p start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

Subsequently, the dynamics are captured using the decoders ψ𝐩superscript𝜓𝐩\psi^{\mathbf{p}}italic_ψ start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT and ψ𝐞superscript𝜓𝐞\psi^{\mathbf{e}}italic_ψ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT, leading to the prediction of the SOI particle graph at time t+1𝑡1t+1italic_t + 1:

𝐞^k,tsubscript^𝐞𝑘𝑡\displaystyle\mathbf{\hat{e}}_{k,t}over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT =ψ𝐞(zj,t𝐞)j=1,,|Et|absentsuperscript𝜓𝐞subscriptsuperscriptsubscript𝑧𝑗𝑡𝐞𝑗1subscript𝐸𝑡\displaystyle=\psi^{\mathbf{e}}(z_{j,t}^{\mathbf{e}})_{j=1,\cdots,|E_{t}|}= italic_ψ start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j = 1 , ⋯ , | italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUBSCRIPT (5)
𝐩^i,t+1subscript^𝐩𝑖𝑡1\displaystyle\hat{\mathbf{p}}_{i,t+1}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_i , italic_t + 1 end_POSTSUBSCRIPT =ψ𝐩(zi,t𝐩,j𝒩i𝐞^k,t)i=1,,|𝐏t|absentsuperscript𝜓𝐩subscriptsuperscriptsubscript𝑧𝑖𝑡𝐩subscript𝑗subscript𝒩𝑖subscript^𝐞𝑘𝑡𝑖1subscript𝐏𝑡\displaystyle=\psi^{\mathbf{p}}\left(z_{i,t}^{\mathbf{p}},\sum_{j\in\mathcal{N% }_{i}}\mathbf{\hat{e}}_{k,t}\right)_{i=1,\cdots,|\mathbf{P}_{t}|}= italic_ψ start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_p end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , ⋯ , | bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUBSCRIPT

Within this context, 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the set of edges where particle i𝑖iitalic_i is the recipient. To adeptly handle the instantaneous propagation of forces, the training also integrates multistep message passing.

Our training data, derived from point clouds, lacks a consistent point-to-point map** across each frame, thus precluding the use of particle-wise loss functions. To quantify the resemblance between two sets of SOI particle distributions, we investigate two distinct loss functions. The first is the widely-adopted Chamfer distance (CD), computed between two particle sets 𝐏1,𝐏23subscript𝐏1subscript𝐏2superscript3\mathbf{P}_{1},\mathbf{P}_{2}\subseteq\mathbb{R}^{3}bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊆ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT as follows:

CD(𝐏1,𝐏2)=𝐱1𝐏1min𝐱2𝐏2𝐱1𝐱222+𝐱2𝐏2min𝐱1𝐏1𝐱1𝐱222subscriptCDsubscript𝐏1subscript𝐏2subscriptsubscript𝐱1subscript𝐏1subscriptsubscript𝐱2subscript𝐏2superscriptsubscriptnormsubscript𝐱1subscript𝐱222subscriptsubscript𝐱2subscript𝐏2subscriptsubscript𝐱1subscript𝐏1superscriptsubscriptnormsubscript𝐱1subscript𝐱222\small\mathcal{L}_{\text{CD}}(\mathbf{P}_{1},\mathbf{P}_{2})=\sum_{\mathbf{x}_% {1}\in\mathbf{P}_{1}}\min_{\mathbf{x}_{2}\in\mathbf{P}_{2}}\|\mathbf{x}_{1}-% \mathbf{x}_{2}\|_{2}^{2}+\sum_{\mathbf{x}_{2}\in\mathbf{P}_{2}}\min_{\mathbf{x% }_{1}\in\mathbf{P}_{1}}\|\mathbf{x}_{1}-\mathbf{x}_{2}\|_{2}^{2}caligraphic_L start_POSTSUBSCRIPT CD end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (6)

The second is the Earth mover’s distance (EMD), which is formulated as:

EMD(𝐏1,𝐏2)=minϕ:𝐏1𝐏2𝐱1𝐏1𝐱1ϕ(𝐱1)2subscriptEMDsubscript𝐏1subscript𝐏2subscript:italic-ϕsubscript𝐏1subscript𝐏2subscriptsubscript𝐱1subscript𝐏1subscriptnormsubscript𝐱1italic-ϕsubscript𝐱12\small\mathcal{L}_{\text{EMD}}(\mathbf{P}_{1},\mathbf{P}_{2})=\min_{\phi:% \mathbf{P}_{1}\rightarrow\mathbf{P}_{2}}\sum_{\mathbf{x}_{1}\in\mathbf{P}_{1}}% \|\mathbf{x}_{1}-\phi(\mathbf{x}_{1})\|_{2}caligraphic_L start_POSTSUBSCRIPT EMD end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_ϕ : bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ϕ ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (7)

where ϕ:𝐏1𝐏2:italic-ϕsubscript𝐏1subscript𝐏2\phi:\mathbf{P}_{1}\rightarrow\mathbf{P}_{2}italic_ϕ : bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes a bijective map**. The EMD represents a solution to the assignment problem, ensuring a unique and stable optimal bijection ϕitalic-ϕ\phiitalic_ϕ for almost every pair of particle sets, invariant to infinitesimally small point displacements. In essence, EMD in our framework aligns distributions while mitigating point cloud anomalies through the definition of bijective correspondence. Regarding the Chamfer distance, it’s worth noting that its use here is somewhat liberal, as it does not fulfill the triangle inequality property. Our composite loss function integrates these distances in a weighted fashion: (𝐏1,𝐏2)=αCD(𝐏1,𝐏2)+βEMD(𝐏1,𝐏2)subscript𝐏1subscript𝐏2𝛼subscriptCDsubscript𝐏1subscript𝐏2𝛽subscriptEMDsubscript𝐏1subscript𝐏2\mathcal{L}(\mathbf{P}_{1},\mathbf{P}_{2})=\alpha\mathcal{L}_{\text{CD}}(% \mathbf{P}_{1},\mathbf{P}_{2})+\beta\mathcal{L}_{\text{EMD}}(\mathbf{P}_{1},% \mathbf{P}_{2})caligraphic_L ( bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_α caligraphic_L start_POSTSUBSCRIPT CD end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_β caligraphic_L start_POSTSUBSCRIPT EMD end_POSTSUBSCRIPT ( bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Empirical evaluations suggest that the optimal weights are α=0.85𝛼0.85\alpha=0.85italic_α = 0.85 and β=0.15𝛽0.15\beta=0.15italic_β = 0.15.

IV-C Model Predictive Control

Upon training our SOI-centric latent dynamics model, we integrate a model predictive control (MPC) approach to control the robotic gripper in manipulating the fabric bag, as depicted in Fig. 4(c). We simplify the gripper’s action space into a parameterized form: (x,y,z,rz)𝑥𝑦𝑧subscript𝑟𝑧(x,y,z,r_{z})( italic_x , italic_y , italic_z , italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ), with {x,y,z}𝑥𝑦𝑧\{x,y,z\}{ italic_x , italic_y , italic_z } representing the end-effector’s position interacting with the bag handles, and rzsubscript𝑟𝑧r_{z}italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT indicating the gripper’s rotation around the vertical z𝑧zitalic_z axis. We omit rxsubscript𝑟𝑥r_{x}italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and rysubscript𝑟𝑦r_{y}italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT rotations based on empirical evidence suggesting minimal SOI deformation from these movements. A goal-oriented MPC is employed as below, using 𝐠𝐠\mathbf{g}bold_g to symbolize the desired SOI shape and 𝐚0,𝐚10:T1subscriptsuperscript𝐚0superscript𝐚1:0𝑇1\langle\mathbf{a}^{0},~{}\mathbf{a}^{1}\rangle_{0:{T-1}}⟨ bold_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT 0 : italic_T - 1 end_POSTSUBSCRIPT to represent the sequence of action pairs derived via MPC, with T𝑇Titalic_T as the planning horizon.

min𝐚0,𝐚10:T1subscriptsubscriptsuperscript𝐚0superscript𝐚1:0𝑇1\displaystyle\min_{\langle\mathbf{a}^{0},~{}\mathbf{a}^{1}\rangle_{0:T-1}}roman_min start_POSTSUBSCRIPT ⟨ bold_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT 0 : italic_T - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 𝒥(𝐨^T,𝐠)=min𝐚0,𝐚10:T1(𝐏^T,𝐏𝐠)𝒥subscript^𝐨𝑇𝐠subscriptsubscriptsuperscript𝐚0superscript𝐚1:0𝑇1subscript^𝐏𝑇subscript𝐏𝐠\displaystyle\mathcal{J}\left(\mathbf{\hat{o}}_{T},\mathbf{g}\right)=\min_{% \langle\mathbf{a}^{0},~{}\mathbf{a}^{1}\rangle_{0:T-1}}\mathcal{L}\left(% \mathbf{\hat{P}}_{T},\mathbf{P}_{\mathbf{g}}\right)caligraphic_J ( over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_g ) = roman_min start_POSTSUBSCRIPT ⟨ bold_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT 0 : italic_T - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT ) (8)
s.t. 𝒢(𝐨^t+1)=fθdym(𝒢(𝐨^th:t),𝐚0,𝐚1t)𝒢subscript^𝐨𝑡1subscript𝑓subscript𝜃dym𝒢subscript^𝐨:𝑡𝑡subscriptsuperscript𝐚0superscript𝐚1𝑡\displaystyle\mathcal{G}(\mathbf{\hat{o}}_{t+1})=f_{\theta_{\text{dym}}}\left(% \mathcal{G}(\mathbf{\hat{o}}_{t-h:t}),\langle\mathbf{a}^{0},~{}\mathbf{a}^{1}% \rangle_{t}\right)caligraphic_G ( over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT dym end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_G ( over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT italic_t - italic_h : italic_t end_POSTSUBSCRIPT ) , ⟨ bold_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
𝐏𝐠=𝒢(𝐠)subscript𝐏𝐠𝒢𝐠\displaystyle\mathbf{P}_{\mathbf{g}}=\mathcal{G}(\mathbf{g})bold_P start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT = caligraphic_G ( bold_g )
𝐏^t=𝒢(𝐨^t)subscript^𝐏𝑡𝒢subscript^𝐨𝑡\displaystyle\mathbf{\hat{P}}_{t}=\mathcal{G}(\mathbf{\hat{o}}_{t})over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_G ( over^ start_ARG bold_o end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
t=0,1,,T1𝑡01𝑇1\displaystyle t=0,1,\ldots,T-1italic_t = 0 , 1 , … , italic_T - 1

The objective is to identify action sequences that minimize the hybrid distance to the goal, expressed as 𝒥(𝐠^T,𝐠)(𝐏^T,𝐏𝐠)𝒥subscript^𝐠𝑇𝐠subscript^𝐏𝑇subscript𝐏𝐠\mathcal{J}(\mathbf{\hat{g}}_{T},\mathbf{g})\equiv\mathcal{L}(\mathbf{\hat{P}}% _{T},\mathbf{P}_{\mathbf{g}})caligraphic_J ( over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_g ) ≡ caligraphic_L ( over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT ). Gradient-based trajectory optimization is utilized to ascertain the minimal-cost trajectory, showcased in Fig. 4(c). Initial random shooting in the simplified action space is followed by cost calculation via the GNN-driven dynamics model. Subsequently, gradient refinement with the limited-memory BFGS method [32] is conducted on the most cost-effective trajectories, employing the same loss function used during the dynamics model’s training phase.

Refer to caption
Figure 6: Experimental set-up. (a) Overview of the experimental set-up to validate our SOI-based dynamics model for bimanual deformable bag manipulation tasks. (b)-(d) The SOI shape categories are considered in experiments: Long Oval (LO), Round Oval (RO), and Short Oval (SO).
TABLE I: Averaged SOI particle sampling results across 90 frames.
Sampling Methods CD(cm) \downarrow EMD(cm) \downarrow GD(cm) \downarrow
Global Sampling (GPS) 1.68±0.21plus-or-minus1.680.211.68\pm 0.211.68 ± 0.21 1.27±0.35plus-or-minus1.270.351.27\pm 0.351.27 ± 0.35 2.14±0.32plus-or-minus2.140.322.14\pm 0.322.14 ± 0.32
Local Sampling (LPS) 1.46±0.11plus-or-minus1.460.11\mathbf{1.46}\pm 0.11bold_1.46 ± 0.11 1.08±0.22plus-or-minus1.080.22\mathbf{1.08}\pm 0.22bold_1.08 ± 0.22 1.97±0.23plus-or-minus1.970.23\mathbf{1.97}\pm 0.23bold_1.97 ± 0.23

V Experiments

V-A Experiment Setup

As shown in Fig. 6, we present the general experimental setup used to validate our SOI-based dynamics model for bimanual deformable bag manipulation tasks. The setup includes four RealSense D435 RGBD cameras, each positioned at a corner, to capture RGBD images of the deformable fabric bag from multiple angles at a frequency of 30Hz and a resolution of 640×\times×480. Two Dobot CR5 robotic manipulators, equipped with 7 degrees of freedom, grasp the bag’s handles using 3D-printed grippers secured with zip ties. The Structure of Interest (SOI) of the bag, specifically the opening rim, has been modified for better SOI perception: it is cut and sewn with the same type of fabric but in a contrasting green color, which facilitates the detection while maintaining the consistency of the bag’s dynamic behavior. Additionally, a top-down camera and a front-facing camera are employed to capture the experimental process from various perspectives.

A dataset of 18,000 frames, around 30 minutes, comprises the training data, featuring 200 episodes, each spanning 90 frames. Within every episode, we execute six actions on the fabric bag. Before each episode’s inception, the fabric bag is orderly arranged to ensure the opening rim protrudes outward without excessive wrinkling. The data collection policy indiscriminately selects from the action space parameters after doing collision checking and bag constraint checking, which include translations along the x𝑥xitalic_x, y𝑦yitalic_y, z𝑧zitalic_z axes, and the rotation around rz𝑟𝑧rzitalic_r italic_z. Throughout each episode, we archive partial point clouds captured by RGB-D cameras, as well as the robot end-effector’s poses as returned by the robot controller. Data recording is contingent on the gripper’s movement, an approach taken to optimize memory consumption.

As illustrated in Fig. 6(b)-(d), this study considers three SOI shape categories: Long Oval (LO), Round Oval (RO), and Short Oval (SO). The experiment comprises two types of manipulation tasks. In the SOI shape preservation task, the robotic manipulators aim to keep the SOI shape consistent while moving the fabric bag from its initial position to a predefined target location. Both the initial and target configurations share the same shape but differ in their spatial transformations. Conversely, the SOI shape servoing task involves altering the SOI shape from one configuration to another distinct shape configuration by manipulating the bag handles. We conduct a quantitative evaluation of each component within our framework, encompassing SOI particle sampling, the SOI dynamics model, and performance results of designed manipulation tasks. Our primary metrics for evaluating the manipulation performance are Chamfer distance (CD) and Earth mover’s distance (EMD), along with a hybrid metric that combines both. Furthermore, we employ Geodesic distance (GD) [20] to assess the proximity of boundary points residing on one manifold with boundary to those on another.

TABLE II: The performance of the SOI particle dynamics model with different loss function.
Loss Functions CD(cm) \downarrow EMD(cm) \downarrow GD(cm) \downarrow
CD 1.52±0.19plus-or-minus1.520.191.52\pm 0.191.52 ± 0.19 1.94±0.18plus-or-minus1.940.181.94\pm 0.181.94 ± 0.18 2.53±0.25plus-or-minus2.530.252.53\pm 0.252.53 ± 0.25
EMD 1.85±0.17plus-or-minus1.850.171.85\pm 0.171.85 ± 0.17 1.37±0.15plus-or-minus1.370.15\mathbf{1.37}\pm 0.15bold_1.37 ± 0.15 2.76±0.13plus-or-minus2.760.132.76\pm 0.132.76 ± 0.13
0.20.20.20.2 CD + 0.80.80.80.8 EMD 1.59±0.16plus-or-minus1.590.161.59\pm 0.161.59 ± 0.16 1.40±0.17plus-or-minus1.400.171.40\pm 0.171.40 ± 0.17 2.13±0.18plus-or-minus2.130.182.13\pm 0.182.13 ± 0.18
0.10.10.10.1 CD + 0.90.90.90.9 EMD 1.55±0.17plus-or-minus1.550.171.55\pm 0.171.55 ± 0.17 1.43±0.16plus-or-minus1.430.161.43\pm 0.161.43 ± 0.16 2.28±0.19plus-or-minus2.280.192.28\pm 0.192.28 ± 0.19
0.150.150.150.15 CD + 0.850.850.850.85 EMD 1.50±0.16plus-or-minus1.500.16\mathbf{1.50}\pm 0.16bold_1.50 ± 0.16 1.38±0.17plus-or-minus1.380.171.38\pm 0.171.38 ± 0.17 2.02±0.16plus-or-minus2.020.16\mathbf{2.02}\pm 0.16bold_2.02 ± 0.16
Refer to caption
Figure 7: Qualitative results in (a) SOI shape preserving and (b) SOI shape servoing experiments.

V-B SOI Particle Sampling

We commence by benchmarking the proposed Global Particle Sampling (GPS) technique against the Local Particle Sampling (LPS) baseline. LPS initiates by processing the SOI-specific partial point cloud of the fabric bag through its unique color signature. Subsequently, it encapsulates the incomplete SOI-specific cloud with a convex hull. Following this, it performs the point sampling procedure, eventually amalgamating the sampled points into a full set of SOI-specific particles. As shown in Table I, we compute the mean distance between the sampled particles and the ground-truth particles, the latter captured by a professional 3D scanner. Our analysis indicates that the GPS method incurs lower losses in terms of all distance metrics, thus outperforming the LPS approach. These outcomes align with the premise that incorporating additional topological information about the SOI can markedly enhance the quality of sampling, particularly in scenarios where occlusions are present.

V-C GNN-based SOI Dynamics Model

The training of our GNN model, detailed in Section IV-B, commences with the construction of a graph, where edges are formed between vertices that are within a distance threshold of d=0.04𝑑0.04d=0.04italic_d = 0.04. Every vertex and edge are encoded using 3-layer Multilayer Perceptrons (MLPs), featuring hidden and output layers, each with 300 neurons. The propagation module is constituted by a fully connected neural layer with a layer size of 300. For motion prediction, we employ an additional 3-layer MLP with the hidden layer configured to 300 neurons. ReLU activation functions are utilized throughout the neural networks to introduce non-linearity. The model undergoes training for 120 epochs, employing the Adam optimizer. We have selected a batch size of 32 to balance the trade-off between generalization and computational efficiency. The learning rate is set at 5e5e5\mathrm{e}5 roman_e-4, which is a conventional choice for steady convergence. These hyperparameters were chosen to foster a robust learning process while maintaining the capacity to capture complex patterns within the data.

Subsequently, we evaluate the performance of GNN-based SOI dynamics models using different loss functions. Table 6 indicates that only using CD or EMD individually optimizes for its own metric but not others, but combining CD and EMD losses leads to better overall performance on all metrics, and the Geodesic distance is significantly improved, compared to using them individually. The combination of 0.85CD+0.15EMD0.85CD0.15EMD0.85~{}\text{CD}+0.15~{}\text{EMD}0.85 CD + 0.15 EMD achieves the best performance across all metrics, giving the lowest CD, competitive EMD, and lowest GD, so we select this hybrid distance for the following experiments. The reason is that CD and EMD losses capture different aspects of point cloud similarity. CD compares point-wise distances while EMD measures global shape differences. By combining them, the model is optimized for both local point accuracy and global shape matching. The proper combination of 0.85 CD + 0.15 EMD balances these objectives best.

TABLE III: Mean Hybrid Distance Error for SOI Shape Preserving by Different Dynamics Modeling Methods
Method Long Oval (LO) Round Oval (RO) Short Oval (SO)
MSM 2.63±0.31plus-or-minus2.630.312.63\pm 0.312.63 ± 0.31 2.82±0.38plus-or-minus2.820.382.82\pm 0.382.82 ± 0.38 2.94±0.41plus-or-minus2.940.412.94\pm 0.412.94 ± 0.41
FEM 1.96±0.27plus-or-minus1.960.271.96\pm 0.271.96 ± 0.27 2.13±0.36plus-or-minus2.130.362.13\pm 0.362.13 ± 0.36 2.42±0.38plus-or-minus2.420.382.42\pm 0.382.42 ± 0.38
Ours 1.69±0.18plus-or-minus1.690.18\textbf{1.69}\pm 0.181.69 ± 0.18 1.84±0.22plus-or-minus1.840.22\textbf{1.84}\pm 0.221.84 ± 0.22 2.08±0.21plus-or-minus2.080.21\textbf{2.08}\pm 0.212.08 ± 0.21
TABLE IV: Mean Hybrid Distance Error and Success Rate for SOI Shape Servoing by Different Methods
Method Mean Hybrid Distance Error Success Rate
LO \rightarrow SO SO \rightarrow RO LO \rightarrow RO LO \rightarrow SO SO \rightarrow RO LO \rightarrow RO Total
VS [33] 3.75±0.19plus-or-minus3.750.193.75\pm 0.193.75 ± 0.19 3.69±0.22plus-or-minus3.690.223.69\pm 0.223.69 ± 0.22 3.92±0.23plus-or-minus3.920.233.92\pm 0.233.92 ± 0.23 21/30213021/3021 / 30 19/30193019/3019 / 30 22/30223022/3022 / 30 68.89%percent68.8968.89\%68.89 %
LSC [34] 2.94±0.48plus-or-minus2.940.482.94\pm 0.482.94 ± 0.48 3.14±0.42plus-or-minus3.140.423.14\pm 0.423.14 ± 0.42 2.89±0.44plus-or-minus2.890.442.89\pm 0.442.89 ± 0.44 24/30243024/3024 / 30 24/30243024/3024 / 30 26/30263026/3026 / 30 82.23%percent82.2382.23\%82.23 %
MSM [35] 3.22±0.43plus-or-minus3.220.433.22\pm 0.433.22 ± 0.43 3.49±0.47plus-or-minus3.490.473.49\pm 0.473.49 ± 0.47 3.28±0.39plus-or-minus3.280.393.28\pm 0.393.28 ± 0.39 19/30193019/3019 / 30 21/30213021/3021 / 30 23/30233023/3023 / 30 70.00%percent70.0070.00\%70.00 %
FEM [36] 2.61±0.36plus-or-minus2.610.362.61\pm 0.362.61 ± 0.36 2.82±0.39plus-or-minus2.820.392.82\pm 0.392.82 ± 0.39 2.67±0.33plus-or-minus2.670.332.67\pm 0.332.67 ± 0.33 23/30233023/3023 / 30 22/30223022/3022 / 30 26/30263026/3026 / 30 78.89%percent78.8978.89\%78.89 %
Ours 2.25±0.23plus-or-minus2.250.232.25\pm 0.232.25 ± 0.23 2.39±0.28plus-or-minus2.390.282.39\pm 0.282.39 ± 0.28 2.13±0.24plus-or-minus2.130.242.13\pm 0.242.13 ± 0.24 28/30283028/3028 / 30 30/30303030/3030 / 30 29/30293029/3029 / 30 96.67%

V-D Manipulation Results

To evaluate the performance of the proposed SOI-based dynamics model for deformable fabric bag manipulation, we conducted SOI preserving and servoing experiments on a fabric bag with different oval shapes and the transitions between them. In SOI preserving experiments, we mainly compare the performance of our proposed approach with two commonly-used and well-established dynamics modeling techniques: the Mass-Spring Model (MSM) [35] and the Finite Element Model (FEM) [36]. The comparative analysis encompasses both quantitative and qualitative outcomes, as encapsulated in Figure 7(a) and Table III. Our findings reveal a discernible trend across all examined methods; there is an incremental rise in error corresponding to the transformation of the bag’s shape from a Long Oval (LO) to a Short Oval (SO), suggesting that as the bag opening rim becomes shorter and wider, the complexity of preserving its structure increases. Notably, our GNN-based dynamics model consistently outperforms the MSM and FEM across the entire spectrum of object shapes, achieving the lowest error rates in tasks dedicated to shape preservation. This outcome corroborates the superior expressive capability of our GNN-based model in capturing the intricate deformations of the fabric bags, and maintaining the SOI shape unchanged during the fabric bag moving, surpassing the traditional MSM and FEM approaches, particularly when dealing with objects that exhibit more complex dynamic behaviors. The advantage of our model is most pronounced when interacting with simpler geometric shapes.

In our SOI servoing experiments, we conduct a comprehensive analysis of the manipulation results associated with transitions between distinct shape categories of the structure of interest (SOI). Additionally, we juxtapose our dynamics model with two sophisticated shape servoing techniques tailored for deformable objects–visual servoing (VS) as described by Lagneau et al. [33], and the latent shape control (LSC) model introduced by Qi et al. [34]. The analysis integrates both quantitative and qualitative assessments, as depicted in Fig. 7(b) and detailed in Table IV. We consider three specific shape servoing tasks: LO (Long Oval) \rightarrow SO (Short Oval), SO \rightarrow RO (Round Oval), and LO \rightarrow RO. The evaluation metrics include the Mean Hybrid Distance Error (recorded solely for successful trials) and the Success Rate of SOI servoing. Furthermore, we report the success rate for achieving complete servoing with a threshold error of less than 5555 cm, based on 30 trials for each method and shape transition scenario. The experimental outcomes reveal that our proposed method consistently secured the least shape error, ranging from 2.132.132.132.13 to 2.392.392.392.39 cm across all tested cases, and achieved an outstanding overall success rate of 96.67%percent96.6796.67\%96.67 %. When contrasted with traditional model-based techniques such as the Mass-Spring Model (MSM) and the Finite Element Model (FEM), our data-driven SOI dynamics model exhibited a markedly enhanced capability in manipulating the SOI for the fabric bag. It was observed that transformations involving shorter and wider shapes incurred higher errors due to their inherently complex dynamics. Nevertheless, our proposed method demonstrated a robust performance across all tasks. In conclusion, the bimanual manipulation experiments underscore the effectiveness of the proposed SOI-based dynamics modeling approach, which ensures precise and dependable control over the shapes of deformable objects through advanced predictive modeling and optimization techniques.

VI Conclusion

In this work, we introduced a novel bimanual manipulation framework for deformable bags, centered around a Structure of Interest (SOI)-based latent dynamics model. Our approach effectively integrates multi-view perception, graph neural networks, and model predictive control, facilitating precise and efficient robotic manipulation of flexible materials. Through simulations and real-world experiments, the framework demonstrated promising results in intelligent physical interaction with deformable objects. One limitation of the current system is its reliance on differently colored SOIs to distinguish target manipulation areas, which may not be feasible in all operational settings. Future efforts will aim to enhance the system’s ability to identify and manipulate SOIs without such visual aids, broadening the applicability of our method to a wider array of real-world scenarios.

References

  • [1] H. Yin, A. Varava, and D. Kragic, “Modeling, learning, perception, and control methods for deformable object manipulation,” Sci. Robot., vol. 6, no. 54, 2021.
  • [2] J. Zhu, A. Cherubini, C. Dune, D. Navarro-Alarcon, F. Alambeigi, D. Berenson, F. Ficuciello, K. Harada, J. Kober, X. Li et al., “Challenges and outlook in robotic manipulation of deformable objects,” IEEE Robotics & Automation Magazine, vol. 29, no. 3, pp. 67–77, 2022.
  • [3] Z. Hu, T. Han, P. Sun, J. Pan, and D. Manocha, “3-d deformable object manipulation using deep neural networks,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4255–4261, 2019.
  • [4] I. Garcia-Camacho, J. Borràs, B. Calli, A. Norton, and G. Alenyà, “Household cloth object set: Fostering benchmarking in deformable object manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 5866–5873, 2022.
  • [5] S. Huo, A. Duan, C. Li, P. Zhou, W. Ma, H. Wang, and D. Navarro-Alarcon, “Keypoint-based planar bimanual sha** of deformable linear objects under environmental constraints with hierarchical action framework,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5222–5229, 2022.
  • [6] X. Lin, Z. Huang, Y. Li, J. B. Tenenbaum, D. Held, and C. Gan, “Diffskill: Skill abstraction from differentiable physics for deformable object manipulations with tools,” in International Conference on Learning Representations (ICLR), 2022.
  • [7] P. Zhou, J. Zhu, S. Huo, and D. Navarro-Alarcon, “LaSeSOM: A latent and semantic representation framework for soft object manipulation,” IEEE Robot. Autom. Lett., vol. 6, no. 3, pp. 5381–5388, 2021.
  • [8] X. Provot et al., “Deformation constraints in a mass-spring model to describe rigid cloth behaviour,” in Graphics interface.   Canadian Information Processing Society, 1995, pp. 147–147.
  • [9] K. Tabata, H. Seki, T. Tsuji, and T. Hiramitsu, “Mass spring model for non-uniformed deformable linear object toward dexterous manipulation,” Artificial Life and Robotics, vol. 28, no. 4, pp. 812–822, 2023.
  • [10] V. E. Arriola-Rios, P. Guler, F. Ficuciello, D. Kragic, B. Siciliano, and J. L. Wyatt, “Modeling of deformable objects for robotic manipulation: A tutorial and review,” Frontiers in Robotics and AI, vol. 7, p. 82, 2020.
  • [11] F. Ficuciello, A. Migliozzi, E. Coevoet, A. Petit, and C. Duriez, “Fem-based deformation control for dexterous manipulation of 3d soft objects,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 4007–4013.
  • [12] J. Sanchez, K. Mohy El Dine, J. A. Corrales, B.-C. Bouzgarrou, and Y. Mezouar, “Blind manipulation of deformable objects based on force sensing and finite element modeling,” Frontiers in Robotics and AI, vol. 7, p. 73, 2020.
  • [13] Z. Xu, C. Chi, B. Burchfiel, E. Cousineau, S. Feng, and S. Song, “Dextairity: Deformable manipulation can be a breeze,” arXiv preprint arXiv:2203.01197, 2022.
  • [14] P. Zhou, P. Zheng, J. Qi, C. Li, H.-Y. Lee, A. Duan, L. Lu, Z. Li, L. Hu, and D. Navarro-Alarcon, “Reactive human–robot collaborative manipulation of deformable linear objects using a new topological latent control model,” Robotics and Computer-Integrated Manufacturing, vol. 88, p. 102727, 2024.
  • [15] D. Navarro-Alarcon, H. M. Yip, Z. Wang, Y.-H. Liu, F. Zhong, T. Zhang, and P. Li, “Automatic 3-d manipulation of soft objects by robotic arms with an adaptive deformation model,” IEEE Transactions on Robotics, vol. 32, no. 2, pp. 429–441, 2016.
  • [16] J. Qi, G. Ma, J. Zhu, P. Zhou, Y. Lyu, H. Zhang, and D. Navarro-Alarcon, “Contour moments based manipulation of composite rigid-deformable objects with finite time model estimation and shape/position control,” IEEE/ASME Transactions on Mechatronics, vol. 27, no. 5, pp. 2985–2996, 2021.
  • [17] X. Lin, C. Qi, Y. Zhang, Z. Huang, K. Fragkiadaki, Y. Li, C. Gan, and D. Held, “Planning with spatial-temporal abstraction from point clouds for deformable object manipulation,” in Conference on Robot Learning (CoRL), 2022.
  • [18] A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine, “Combining self-supervised learning and imitation for vision-based rope manipulation,” in 2017 IEEE international conference on robotics and automation (ICRA).   IEEE, 2017, pp. 2146–2153.
  • [19] M. Yan, Y. Zhu, N. **, and J. Bohg, “Self-supervised learning of state estimation for manipulating deformable linear objects,” IEEE robotics and automation letters, vol. 5, no. 2, pp. 2372–2379, 2020.
  • [20] P. Reiser, M. Neubert, A. Eberhard, L. Torresi, C. Zhou, C. Shao, H. Metni, C. van Hoesel, H. Schopmans, T. Sommer et al., “Graph neural networks for materials science and chemistry,” Communications Materials, vol. 3, no. 1, p. 93, 2022.
  • [21] J. Shlomi, P. Battaglia, and J.-R. Vlimant, “Graph neural networks in particle physics,” Machine Learning: Science and Technology, vol. 2, no. 2, p. 021001, 2020.
  • [22] J. Gasteiger, F. Becker, and S. Günnemann, “Gemnet: Universal directional graph neural networks for molecules,” Advances in Neural Information Processing Systems, vol. 34, pp. 6790–6802, 2021.
  • [23] P. Zhou, J. Qi, A. Duan, S. Huo, Z. Wu, and D. Navarro-Alarcon, “Imitating tool-based garment folding from a single visual observation using hand-object graph dynamics,” IEEE Transactions on Industrial Informatics, 2024.
  • [24] T. Wang, R. Liao, J. Ba, and S. Fidler, “Nervenet: Learning structured policy with graph neural networks,” in International conference on learning representations, 2018.
  • [25] E. Tolstaya, F. Gama, J. Paulos, G. Pappas, V. Kumar, and A. Ribeiro, “Learning decentralized controllers for robot swarms with graph neural networks,” in Conference on robot learning.   PMLR, 2020, pp. 671–682.
  • [26] Q. Li, F. Gama, A. Ribeiro, and A. Prorok, “Graph neural networks for decentralized multi-robot path planning,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 11 785–11 792.
  • [27] E. Tolstaya, J. Paulos, V. Kumar, and A. Ribeiro, “Multi-robot coverage and exploration using spatial graph neural networks,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 8944–8950.
  • [28] H. Shi, H. Xu, Z. Huang, Y. Li, and J. Wu, “Robocraft: Learning to see, simulate, and shape elasto-plastic objects with graph networks,” arXiv preprint arXiv:2205.02909, 2022.
  • [29] C. Wang, Y. Zhang, X. Zhang, Z. Wu, X. Zhu, S. **, T. Tang, and M. Tomizuka, “Offline-online learning of deformation model for cable manipulation with graph neural networks,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5544–5551, 2022.
  • [30] Y. Rubanova, A. Sanchez-Gonzalez, T. Pfaff, and P. Battaglia, “Constraint-based graph network simulator,” arXiv preprint arXiv:2112.09161, 2021.
  • [31] H. Bertiche, M. Madadi, and S. Escalera, “Neural cloth simulation,” ACM Transactions on Graphics (TOG), vol. 41, no. 6, pp. 1–14, 2022.
  • [32] R. Fletcher, Practical methods of optimization.   John Wiley & Sons, 2000.
  • [33] R. Lagneau, A. Krupa, and M. Marchal, “Active deformation through visual servoing of soft objects,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 8978–8984.
  • [34] J. Qi, G. Ma, P. Zhou, H. Zhang, Y. Lyu, and D. Navarro-Alarcon, “Towards latent space based manipulation of elastic rods using autoencoder models and robust centerline extractions,” Advanced Robotics, vol. 36, no. 3, pp. 101–115, 2022.
  • [35] F. Makiyeh, M. Marchal, F. Chaumette, and A. Krupa, “Indirect positioning of a 3d point on a soft object using rgb-d visual servoing and a mass-spring model,” in 2022 17th International Conference on Control, Automation, Robotics and Vision (ICARCV).   IEEE, 2022, pp. 235–242.
  • [36] Z. Zhang, T. M. Bieze, J. Dequidt, A. Kruszewski, and C. Duriez, “Visual servoing control of soft robots based on finite element model,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 2895–2901.
[Uncaptioned image] Peng Zhou received the M.Sc. degree in software engineering from Tongji University, Shanghai, China, in 2017, and Ph.D. degree in robotics from The Hong Kong Polytechnic University, Hong Kong SAR, in 2022. In 2021, he was a visiting Ph.D. student at Robotics, Perception and Learning Lab, KTH Royal Institute of Technology, Stockholm, Sweden. He is currently a Research Officer at the Centre for Transformative Garment Production and a Postdoctoral Research Fellow at The University of Hong Kong. His research interests include deformable object manipulation, robot reasoning and learning, and task and motion planning.
[Uncaptioned image] Pai Zheng (Senior Member, IEEE) received the dual bachelor’s degrees in mechanical engineering (major) and computer science and engineering (minor) from the Huazhong University of Science and Technology, Wuhan, China, in 2010, the master’s degree in mechanical engineering from Beihang University, Bei**g, China, in 2013, and the Ph.D. degree in mechanical engineering from The University of Auckland, Auckland, New Zealand, in 2017. He has been a Research Fellow with the Delta-NTU Corporate Laboratory for Cyber-Physical Systems, School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore, from January 2018 to September 2019. He is currently an Assistant Professor with the Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University. His research interests include smart product-service systems, human–robot collaboration, and smart manufacturing systems. He is a member of HKIE, CMES, and ASME. He serves as an Associate Editor for Journal of Intelligent Manufacturing and Journal of Cleaner Production, an Editorial Board Member for the Journal of Manufacturing Systems and Advanced Engineering Informatics, and a guest editor/reviewer for several high impact international journals in the manufacturing and industrial engineering field.
[Uncaptioned image] Jiaming Qi received the Ph.D degree in control science and engineering, and the M.S. degree in Integrated Circuit Engineering from the Harbin Institute of Technology, Harbin, China, in 2023 and 2018, respectively. He performs research in the deformable object manipulation, visual servoing, and human-robot collaboration. He is currently a post-doctoral fellow in the Centre for Transformative Garment Production, The University of Hong Kong, Hong Kong.
[Uncaptioned image] Chengxi Li (Graduate Student Member, IEEE) received the B.E. degree in Information Technology from Vaasa University of Applied Sciences, Finland in 2018, and the M.S. degree in Computer Science from Uppsala University, Sweden in 2020, respectively. He is currently pursuing the Ph.D. degree with the Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, China. His research interests include robot learning, mixed reality, and human–robot collaboration.
[Uncaptioned image] Chenguang Yang (Fellow, IEEE) received the B.Eng. degree in measurement and control from Northwestern Polytechnical University, Xi’an, China, in 2005, the Ph.D. degree in control engineering from the National University of Singapore, Singapore, in 2010, and postdoctoral training in human robotics from the Imperial College London, London, U.K, from 2009 to 2010. He was awarded UK EPSRC UKRI Innovation Fellowship and individual EU Marie Curie International Incoming Fellowship. As the lead author, he won the IEEE Transactions on Robotics Best Paper Award (2012) and IEEE Transactions on Neural Networks and Learning Systems Outstanding Paper Award (2022). He is the leader of Robot Teleoperation Group of Bristol Robotics Laboratory, the Corresponding Co-Chair of the Technical Committee on Collaborative Automation for Flexible Manufacturing (CAFM), IEEE Robotics and Automation Society. He is a Fellow of Institution of Mechanical Engineers (IMechE), Institution of Engineering and Technology (IET), British Computer Society (BCS) and Higher Education Academy (HEA). He has served as Associate Editor of a number of leading international journals including IEEE Transactions on Robotics. He is an elected Member-at-Large of the Board of Governors with IEEE Systems, Man, and Cybernetics Society (SMC), and an AdCom member with IEEE Industrial Electronics Society (IES), 2023-2025. His research interest lies in human robot interaction and intelligent system design.
[Uncaptioned image] David Navarro-Alarcon (Senior Member, IEEE) received the Ph.D. degree in mechanical and automation engineering from The Chinese University of Hong Kong, Hong Kong, in 2014, where he also worked as a Postdoctoral Fellow and Research Assistant Professor. Since 2017, he has been with The Hong Kong Polytechnic University (PolyU), Hong Kong, where he is currently an Associate Professor with the Department of Mechanical Engineering. His research interests include perceptual robotics and control systems. He currently serves as an Associate Editor of the IEEE Transactions on Robotics and Associate Editor of the IEEE Robotics and Automation Magazine.
[Uncaptioned image] Jia Pan (Senior Member, IEEE) received the Ph.D. degree in computer science from the University of North Carolina at Chapel Hill, Chapel Hill, NC, USA, in 2013. He is currently an Associate Professor with the Department of Computer Science, University of Hong Kong, Hong Kong. He is also a member of the Centre for Garment Production Limited, Hong Kong. His research interests include robotics and artificial intelligence as applied to autonomous systems, particularly for navigation and manipulation in challenging tasks such as effective movement in dense human crowds and manipulating deformable objects for garment automation.