Bimanual Deformable Bag Manipulation Using a Structure-of-Interest Based Latent Dynamics Model

Peng Zhou

{}^{1}

, , Pai Zheng

{}^{2}

, , Jiaming Qi

{}^{1}

, Chenxi Li

{}^{2}

, Chenguang Yang

{}^{3}

, , David Navarro-Alarcon

{}^{2}

, , and Jia Pan

{}^{1}

This work is supported by the Innovation and Technology Commission of the HKSAR Government under the InnoHK initiative. (Corresponding author: Jia Pan.)

{}^{1}

The authors are with The University of Hong Kong, HK, Hong Kong. {jeffzhou,tomqi,jpan}@hku.hk

{}^{2}

The authors are with The Hong Kong Polytechnic University, KLN, Hong Kong. {pai.zheng,dnavar}@polyu.edu.hk, [email protected]

{}^{3}

The author is with University of Liverpool, England, UK. [email protected]

Abstract

The manipulation of deformable objects by robotic systems presents a significant challenge due to their complex and infinite-dimensional configuration spaces. This paper introduces a novel approach to Deformable Object Manipulation (DOM) by emphasizing the identification and manipulation of Structures of Interest (SOIs) in deformable fabric bags. We propose a bimanual manipulation framework that leverages a Graph Neural Network (GNN)-based latent dynamics model to succinctly represent and predict the behavior of these SOIs. Our approach involves constructing a graph representation from partial point cloud data of the object and learning the latent dynamics model that effectively captures the essential deformations of the fabric bag within a reduced computational space. By integrating this latent dynamics model with Model Predictive Control (MPC), we empower robotic manipulators to perform precise and stable manipulation tasks focused on the SOIs. We have validated our framework through various empirical experiments demonstrating its efficacy in bimanual manipulation of fabric bags. Our contributions not only address the complexities inherent in DOM but also provide new perspectives and methodologies for enhancing robotic interactions with deformable objects by concentrating on their critical structural elements. Experimental videos can be obtained from https://sites.google.com/view/bagbot

Index Terms:

Deformable object manipulation, structure of interest, latent dynamics model, bimanual manipulation.

I Introduction

Deformable object manipulation (DOM) [1, 2, 3] is a fundamental capability for robots to meaningfully interact with the physical world and assist in various human tasks. However, the manipulation of deformable objects such as cloth [4], rope [5], and food ingredients[6] is particularly challenging due to their infinite-dimensional configuration space and complex dynamics. Traditional methods in DOM often resort to simplified physics models or data-driven modeling with handcrafted features, which lack adaptability for the varied shapes and dynamics of these objects [7]. Moreover, most current DOM works focus on manipulating the entire object, neglecting the critical structures, i.e., Structures of Interest (SOI), that are essential for subsequent manipulation steps.

Refer to caption — Figure 1: (Left) SOI Examples for different deformable object manipulation tasks, e.g., garment hanging, robot-assistive dressing. (Right) Conceptual representation of the manifold with boundary. The manifold encompasses $\operatorname{Int}\mathcal{M}$ and $\partial\mathcal{M}$ , where the local neighborhoods of points in $\operatorname{Int}\mathcal{M}$ and ${\partial\mathcal{M}}$ are homeomorphically equivalent to $\operatorname{Int}\mathbb{H}^{n}$ and $\partial\mathbb{H}^{n}$ .

In this paper, we introduce the concept of SOI into the realm of DOM (see Fig. 1 for an example), a paradigm shift that emphasizes the importance of identifying and manipulating key structural components rather than the entire object. This focus on the SOI is motivated by the observation that successful DOM tasks typically involve the manipulation of these key areas. By targeting these SOIs, we can reduce the computational load significantly, as modeling the complete 3D dynamics of the deformable object is unnecessary and burdensome for the task at hand.

Furthermore, this work is pioneering in considering the bimanual manipulation of a deformable fabric bag as shown in Fig. 2. By proposing a novel bimanual manipulation framework using a GNN-based latent dynamics model, we address the complexities of DOM with a focus on SOIs. Our approach is designed to extract the SOI from the observed object point cloud, construct a graph representation, and learn the latent dynamics model that effectively captures the object’s deformations within a compact space. Integrating this model with model predictive control (MPC), we enable robots to achieve accurate and stable manipulation of deformable bags.

The main contributions of this work include:

•

Pioneering the bimanual manipulation of deformable fabric bags by focusing on SOIs.
•

Introducing the SOI concept for representing deformable object states, which emphasizes the manipulation of critical structures.
•

Designing a GNN-based method to learn a latent dynamics model from partial point cloud data, particularly focusing on the SOIs.
•

Implementing MPC based on the latent dynamics for generating optimal manipulation actions centered around SOIs.

Various empirical experiments on the bimanual manipulation of a fabric bag validate the efficacy of our proposed framework. Our work offers new insights and methodologies for improving robotic systems’ capability for intelligent deformable object manipulation, with a specialized emphasis on the pivotal SOIs.

II Related Work

Deformable object manipulation has been an active research area in robotics. Existing methods can be categorized into model-based and data-driven approaches. Model-based methods rely on simplified physics models to represent deformable objects. Early works used mass-spring models (MSM) to simulate deformation [8, 9]. The finite element method (FEM) provides more accurate modeling of continuum mechanics [10, 11, 12]. However, analytical models require extensive manual tuning and generalization across different materials or shapes remains difficult. Data-driven methods [7, 13, 14] aim to learn models directly from data. Vision-based methods extract geometric features from visual observations to infer deformations [15, 16]. Recent works utilized deep learning on point cloud data and achieved improved modeling accuracy [17]. However, they depend heavily on large labeled datasets. Self-supervised methods were proposed to learn from physical interactions [18, 19]. However, they focused on planar objects and could not handle complex deformations.

Our work is mostly related to robot manipulation using graph neural networks (GNNs). GNNs have shown promising results in learning physics simulations [20, 21, 22, 23, 24] and control policies [25, 26, 27, 28]. Recently, GNNs were introduced to model rope and cloth manipulation [29, 30, 31]. Different from these works, we propose the structure of interest concept to focus on key object components and employ GNNs to learn latent dynamics models for describing complex deformable object behaviors. The integration of the latent dynamics model with MPC also distinguishes our framework from prior arts. In summary, our work aims to advance existing deformable object manipulation by introducing an SOI-based modeling approach. The adoption of state-of-the-art deep learning techniques allows better generalizability across different materials and shapes. The experiments on the bimanual manipulation of fabric bags represent challenging test cases and validate the feasibility of our framework.

III Problem Statement

Given that individual image and depth observations generally do not fully disclose the state of the environment, we approach the task of bimanual bag manipulation as a Partially Observable Markov Decision Process (POMDP) as depicted in Fig. 3. This is formally defined by the tuple $(\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{O},\mathcal{B},\mathcal{R},\gamma)$ , where the state at time $t$ , denoted by $s_{t}$ , belongs to the state space $\mathcal{S}$ and is not directly observable. The state encapsulates the configuration of the robots and the manipulated object. The corresponding observation at time $t$ , denoted by $o_{t}$ , is within the observation space $\mathcal{O}$ . The state transition model $\mathcal{T}(s_{t+1}\mid s_{t},a_{t})$ describes the probability of transitioning from the current state $s_{t}$ to a new state $s_{t+1}$ upon taking an action $a_{t}$ from the action space $\mathcal{A}$ , which consists of the combined left and right robotic actions, represented by the Cartesian product $\mathcal{A}=\mathcal{A}^{1}\times\mathcal{A}^{2}$ . The function $\mathcal{B}(o_{t}\mid s_{t},a_{t-1})$ specifies the likelihood of observing $o_{t}$ after executing action $a_{t-1}$ and transitioning to state $s_{t}$ . The reward function $\mathcal{R}(s_{t},a_{t})$ assigns a valued reward to each state-action pair, and the discount factor $\gamma\in[0,1)$ quantifies the preference for immediate rewards over future rewards.

The goal of this work is to use two 3D-printed robot grippers to grasp the handles to manipulate the opening rim to achieve a target state $\mathbf{g}$ . We assume this deformable bag manipulation task is a quasi-static manipulation and dynamic manipulation motions are not considered. At time step $t$ , the robots apply action $(\mathbf{a}^{1}_{t},\mathbf{a}^{2}_{t})\in\mathcal{A}$ upon the bag, and we can partially observe transitions of the bag from $\mathbf{o}_{t}$ to $\mathbf{o}_{t+1}$ under the unknown state transitions from $\mathbf{s}_{t}$ to $\mathbf{s}_{t+1}$ . However, a complete observation of the bag is not necessary, in our task, the opening rim of the bag is critical for successful manipulation tasks since it not only determines the manipulation task goals but also provides the most informative sensory feedback, such as visual landmarks, during manipulation. We define the Structure of Interest (SOI) in the context of Deformable Object Manipulation (DOM) refers to specific regions or features of a deformable object that are critical for successful manipulation tasks (see Fig. 1 for examples.). Therefore, in this task, we consider the opening rim of the manipulated bag as our SOI points, and topologically, we can define this loop-like structure as a manifold with boundary. As illustrated in Fig. 1, we also define its interior and boundary as $\operatorname{Int}\mathcal{M}$ and ${\partial\mathcal{M}}$ , whose points’ neighborhoods are respectively homeomorphic to $\operatorname{Int}\mathbb{H}^{n}=\{(x_{1},\ldots,x_{n})\mid x_{n}>0\}$ and $\partial\mathbb{H}^{n}=\{(x_{1},\ldots,x_{n})\mid x_{n}=0\}$ .

With an appropriate perception module, the observation of the Structure of Interest (SOI), denoted by $\hat{o}_{t}$ , can be extracted from the overall observation of the bag $o_{t}$ . Our approach is based on the insight that it is more efficacious to predict the dynamics of the SOI rather than the entire complex dynamics of the bag. To this end, we employ a Graph Neural Network (GNN) to establish a dynamics model $f_{\theta_{\text{dym}}}$ that is dedicated to learning the transition functions of the SOI, defined as $f_{\theta_{\text{dym}}}:\hat{\mathcal{O}}\times\mathcal{A}\rightarrow\hat{% \mathcal{O}}$ . This dynamics model accepts as input a sequence of SOI observations $\mathbf{\hat{o}}_{t-n,\ldots,t}\in\mathcal{\hat{O}}$ and actions $(\mathbf{a}^{1}_{t-n,\ldots,t},\mathbf{a}^{2}_{t-n,\ldots,t})\in\mathcal{A}$ , and predicts the subsequent observation $\mathbf{\hat{o}}_{t+1}$ , where $n$ represents the length of the observation history before the current time step $t$ . With the dynamics model, we proceed to cast the bimanual manipulation of the bag as a task within the Model Predictive Control (MPC) framework. Within this MPC setup, the cost function $\mathcal{J}$ quantifies the difference between the final SOI feature points at time step $T$ and the targeted SOI state $\mathbf{g}$ . Details on the precise structure of the cost function $\mathcal{J}$ are illustrated in the following section. This cost is minimized to yield an optimal sequence of actions across a temporal horizon of $T$ steps:

	$\displaystyle\langle\mathbf{a}^{0},~{}\mathbf{a}^{1}\rangle_{0:{T-1}}$	$\displaystyle=\underset{\langle\mathbf{a}^{0},\mathbf{a}^{1}\rangle_{0:T-1}\in% \mathcal{A}}{\arg\min}\mathcal{J}(\mathbf{\hat{g}},\mathbf{g})$		(1)
	$\displaystyle\text{where}\quad\mathbf{\hat{g}}$	$\displaystyle=f_{\theta_{dym}}\left(\mathbf{\hat{o}}_{0},\langle\mathbf{a}^{0}% ,\mathbf{a}^{1}\rangle_{0:{T-1}}\right)$		(1)

where $\hat{\mathbf{g}}$ represents the predicted SOI state at time step $T$ . Fig. 4 shows the overall framework of our proposed latent dynamics model for the bimanual bag manipulation task.

IV Methodology

IV-A Global Particle Sampling from Raw Observation

In this work, we begin by sampling representative particles from raw visual inputs, i.e., RGB-D images, to facilitate the training of Graph Neural Network (GNN)-based dynamics model, depicted in Fig. 4(a). The challenge lies in extracting meaningful particle representations of the Structure of Interest (SOI) from visual data that is significantly occluded. To this end, we introduce a comprehensive global particle sampling methodology, as illustrated in Fig. 5.

Preprocessing: The initial phase of our methodology involves refining the raw point cloud data obtained from RealSense D435 cameras. With appropriate calibration, we can transform the raw RGBD images into point cloud data with the cameras’ intrinsic and extrinsic parameters. Through meticulous calibration, we can convert raw RGB-D images into point cloud data, utilizing the intrinsic and extrinsic parameters of the cameras. To enhance data quality, a statistical outlier removal filter is applied to strip away noise, paving the way for accurate analysis. Focusing on the bag’s opening rim, which is crafted from green fabric, we employ a color-based segmentation algorithm. A specific color threshold within the HSV color space is set to discern green hues, effectively isolating the rim from the rest of the bag. Concurrently, we identify and omit the points marking the bag’s handle areas where robots will interact with the fabric using a separate HSV threshold. To ensure the rim’s point cloud is distinctly segregated from other bag components, Euclidean clustering is performed as a concluding step, particularly if handle detection leaves any connected segments. This meticulous process results in a purified dataset, optimal for the rim’s geometric representation.

Surface Reconstruction: To reconstruct the green rim’s surface from its point cloud, we leverage Poisson reconstruction, a preferred technique for its proficiency in managing occlusions by interpolating over incomplete data to form a smooth, continuous surface. Traditional methods like ball pivoting and alpha shapes underperform in occluded scenarios. Before reconstruction, we compute point normal via local principal axis analysis to guide the Poisson algorithm. The resulting surface is further polished with a uniform resampling algorithm, ensuring mesh uniformity and feature preservation for subsequent modeling and manipulation tasks.

Refinement with Topological Prior: In this step, we ensure that the reconstructed surface conforms to the anticipated topological structure of the bag’s opening rim through a two-fold approach: 1) Topological Analysis: We incorporate topological priors by recognizing that the rim should form a contiguous loop. This knowledge is employed to scrutinize the mesh, detecting any topological discrepancies. 2) Corrective Actions: Upon identifying any topological deviations, we undertake corrective measures by cutting and remeshing to realign the mesh with a loop topology that accurately represents the bag’s rim.

Resampling: In the final stage of our process, our objective is to generate a high-quality particle set that accurately represents the geometry of the rim. To achieve this, we initially implement voxel grid down-sampling to attain a uniform distribution of vertices while concurrently eliminating statistical outliers. Subsequently, we employ the farthest point sampling technique to selectively reduce the point cloud associated with the Structure of Interest (SOI) to a manageable quantity conducive to Graph Neural Network (GNN) training. The culmination of this process, resulting in a refined particle dataset, is depicted in Fig. 5(d).

IV-B SOI Dynamics Model

To characterize the dynamics of SOI particles, as illustrated in Fig. 4(b), we begin to construct a particle graph and introduce the graph neural networks to model the SOI dynamics. To construct a particle graph, we denote a graph representing SOI observations as $\mathcal{G}(\mathbf{\hat{o}}_{t})=(V_{t},E_{t})$ , and the graph vertices $V_{t}$ correspond to the SOI’s particles $\mathbf{p}_{i,t}$ . Each particle is expressed as $\mathbf{p}_{i,t}=\langle\mathbf{x}_{i,t},\mathbf{b}_{i,t}\rangle$ , where the position and attributes of the particle $i$ at time $t$ are denoted as $\mathbf{x}_{i,t}$ and $\mathbf{b}_{i,t}$ , respectively. Edges $E_{t}$ link vertices dynamically, based on spatial relationships, connecting all neighboring particles within a predefined range. Edge relations are captured by $\mathbf{e}_{j}=\langle m_{j},n_{j},\mathbf{c}_{j}\rangle$ , with $1\leq m_{j},n_{j}\leq|\mathbf{P}_{t}|$ denoting the indices of the connected particles, $j$ being the edge index, and $\mathbf{c}_{j}$ describing the type of connection, be it internal structural or handle-to-rim linkages.

The goal of introducing GNNs is to simulate the dynamics of the SOI and to predict subsequent states from a short historical sequence of SOI particle graphs, formalized as:

\mathcal{G}(\mathbf{\hat{o}}_{t+1})=f_{\theta_{\text{dyn}}}\left(\mathcal{G}(% \mathbf{\hat{o}}_{t-h:t}),\langle\mathbf{a}^{0},~{}\mathbf{a}^{1}\rangle_{t}\right)

(2)

To facilitate this, the original high-dimensional observation space concerning the SOI is reduced into a condensed, low-dimensional latent graph by encoding the distinct particle and connection features of the SOI. The loss function of each encoder is defined based on a distance metric as:

	$\displaystyle\phi^{\mathbf{p}},\psi^{\mathbf{p}}$	$\displaystyle=\arg\min_{\phi^{\mathbf{v}},\psi^{\mathbf{v}}}\operatorname{Dist% }[V-(\phi^{\mathbf{p}}\circ\psi^{\mathbf{p}})(P)]$		(3)
	$\displaystyle\phi^{\mathbf{e}},\psi^{\mathbf{e}}$	$\displaystyle=\arg\min_{\phi^{\mathbf{e}},\psi^{\mathbf{e}}}\operatorname{Dist% }[E-(\phi^{\mathbf{e}}\circ\psi^{\mathbf{e}})(E)]$		(3)

Here, $\phi^{\mathbf{p}}:V\rightarrow Z_{V}$ and $\phi^{\mathbf{e}}:E\rightarrow Z_{E}$ serve as the encoders for SOI particles and edges, while $\psi^{\mathbf{p}}:Z_{V}\rightarrow V$ and $\psi^{\mathbf{e}}:Z_{E}\rightarrow E$ function as their respective decoders. The encoding of SOI particle and edge features via $\phi^{\mathbf{p}}$ and $\phi^{\mathbf{e}}$ is executed as follows:

	$\displaystyle z_{i,t}^{\mathbf{p}}$	$\displaystyle=\phi^{\mathbf{p}}(\mathbf{p}_{i,t})$		(4)
	$\displaystyle z_{j,t}^{\mathbf{e}}$	$\displaystyle=\phi^{\mathbf{e}}(\mathbf{p}_{m_{j},t},\mathbf{p}_{n_{j},t},% \mathbf{c}_{j})$		(4)

Subsequently, the dynamics are captured using the decoders $\psi^{\mathbf{p}}$ and $\psi^{\mathbf{e}}$ , leading to the prediction of the SOI particle graph at time $t+1$ :

	$\displaystyle\mathbf{\hat{e}}_{k,t}$	$\displaystyle=\psi^{\mathbf{e}}(z_{j,t}^{\mathbf{e}})_{j=1,\cdots,\|E_{t}\|}$		(5)
	$\displaystyle\hat{\mathbf{p}}_{i,t+1}$	$\displaystyle=\psi^{\mathbf{p}}\left(z_{i,t}^{\mathbf{p}},\sum_{j\in\mathcal{N% }_{i}}\mathbf{\hat{e}}_{k,t}\right)_{i=1,\cdots,\|\mathbf{P}_{t}\|}$		(5)

Within this context, $\mathcal{N}_{i}$ denotes the set of edges where particle $i$ is the recipient. To adeptly handle the instantaneous propagation of forces, the training also integrates multistep message passing.

Our training data, derived from point clouds, lacks a consistent point-to-point map** across each frame, thus precluding the use of particle-wise loss functions. To quantify the resemblance between two sets of SOI particle distributions, we investigate two distinct loss functions. The first is the widely-adopted Chamfer distance (CD), computed between two particle sets $\mathbf{P}_{1},\mathbf{P}_{2}\subseteq\mathbb{R}^{3}$ as follows:

\small\mathcal{L}_{\text{CD}}(\mathbf{P}_{1},\mathbf{P}_{2})=\sum_{\mathbf{x}_% {1}\in\mathbf{P}_{1}}\min_{\mathbf{x}_{2}\in\mathbf{P}_{2}}\|\mathbf{x}_{1}-% \mathbf{x}_{2}\|_{2}^{2}+\sum_{\mathbf{x}_{2}\in\mathbf{P}_{2}}\min_{\mathbf{x% }_{1}\in\mathbf{P}_{1}}\|\mathbf{x}_{1}-\mathbf{x}_{2}\|_{2}^{2}

(6)

The second is the Earth mover’s distance (EMD), which is formulated as:

\small\mathcal{L}_{\text{EMD}}(\mathbf{P}_{1},\mathbf{P}_{2})=\min_{\phi:% \mathbf{P}_{1}\rightarrow\mathbf{P}_{2}}\sum_{\mathbf{x}_{1}\in\mathbf{P}_{1}}% \|\mathbf{x}_{1}-\phi(\mathbf{x}_{1})\|_{2}

(7)

where $\phi:\mathbf{P}_{1}\rightarrow\mathbf{P}_{2}$ denotes a bijective map**. The EMD represents a solution to the assignment problem, ensuring a unique and stable optimal bijection $\phi$ for almost every pair of particle sets, invariant to infinitesimally small point displacements. In essence, EMD in our framework aligns distributions while mitigating point cloud anomalies through the definition of bijective correspondence. Regarding the Chamfer distance, it’s worth noting that its use here is somewhat liberal, as it does not fulfill the triangle inequality property. Our composite loss function integrates these distances in a weighted fashion: $\mathcal{L}(\mathbf{P}_{1},\mathbf{P}_{2})=\alpha\mathcal{L}_{\text{CD}}(% \mathbf{P}_{1},\mathbf{P}_{2})+\beta\mathcal{L}_{\text{EMD}}(\mathbf{P}_{1},% \mathbf{P}_{2})$ . Empirical evaluations suggest that the optimal weights are $\alpha=0.85$ and $\beta=0.15$ .

IV-C Model Predictive Control

Upon training our SOI-centric latent dynamics model, we integrate a model predictive control (MPC) approach to control the robotic gripper in manipulating the fabric bag, as depicted in Fig. 4(c). We simplify the gripper’s action space into a parameterized form: $(x,y,z,r_{z})$ , with $\{x,y,z\}$ representing the end-effector’s position interacting with the bag handles, and $r_{z}$ indicating the gripper’s rotation around the vertical $z$ axis. We omit $r_{x}$ and $r_{y}$ rotations based on empirical evidence suggesting minimal SOI deformation from these movements. A goal-oriented MPC is employed as below, using $\mathbf{g}$ to symbolize the desired SOI shape and $\langle\mathbf{a}^{0},~{}\mathbf{a}^{1}\rangle_{0:{T-1}}$ to represent the sequence of action pairs derived via MPC, with $T$ as the planning horizon.

$\displaystyle\min_{\langle\mathbf{a}^{0},~{}\mathbf{a}^{1}\rangle_{0:T-1}}$	$\displaystyle\mathcal{J}\left(\mathbf{\hat{o}}_{T},\mathbf{g}\right)=\min_{% \langle\mathbf{a}^{0},~{}\mathbf{a}^{1}\rangle_{0:T-1}}\mathcal{L}\left(% \mathbf{\hat{P}}_{T},\mathbf{P}_{\mathbf{g}}\right)$	(8)
s.t.	$\displaystyle\mathcal{G}(\mathbf{\hat{o}}_{t+1})=f_{\theta_{\text{dym}}}\left(% \mathcal{G}(\mathbf{\hat{o}}_{t-h:t}),\langle\mathbf{a}^{0},~{}\mathbf{a}^{1}% \rangle_{t}\right)$
	$\displaystyle\mathbf{P}_{\mathbf{g}}=\mathcal{G}(\mathbf{g})$
	$\displaystyle\mathbf{\hat{P}}_{t}=\mathcal{G}(\mathbf{\hat{o}}_{t})$
	$\displaystyle t=0,1,\ldots,T-1$

The objective is to identify action sequences that minimize the hybrid distance to the goal, expressed as $\mathcal{J}(\mathbf{\hat{g}}_{T},\mathbf{g})\equiv\mathcal{L}(\mathbf{\hat{P}}% _{T},\mathbf{P}_{\mathbf{g}})$ . Gradient-based trajectory optimization is utilized to ascertain the minimal-cost trajectory, showcased in Fig. 4(c). Initial random shooting in the simplified action space is followed by cost calculation via the GNN-driven dynamics model. Subsequently, gradient refinement with the limited-memory BFGS method [32] is conducted on the most cost-effective trajectories, employing the same loss function used during the dynamics model’s training phase.

TABLE I: Averaged SOI particle sampling results across 90 frames.

Sampling Methods	CD(cm) $\downarrow$	EMD(cm) $\downarrow$	GD(cm) $\downarrow$
Global Sampling (GPS)	$1.68\pm 0.21$	$1.27\pm 0.35$	$2.14\pm 0.32$
Local Sampling (LPS)	$\mathbf{1.46}\pm 0.11$	$\mathbf{1.08}\pm 0.22$	$\mathbf{1.97}\pm 0.23$

V Experiments

V-A Experiment Setup

As shown in Fig. 6, we present the general experimental setup used to validate our SOI-based dynamics model for bimanual deformable bag manipulation tasks. The setup includes four RealSense D435 RGBD cameras, each positioned at a corner, to capture RGBD images of the deformable fabric bag from multiple angles at a frequency of 30Hz and a resolution of 640 $\times$ 480. Two Dobot CR5 robotic manipulators, equipped with 7 degrees of freedom, grasp the bag’s handles using 3D-printed grippers secured with zip ties. The Structure of Interest (SOI) of the bag, specifically the opening rim, has been modified for better SOI perception: it is cut and sewn with the same type of fabric but in a contrasting green color, which facilitates the detection while maintaining the consistency of the bag’s dynamic behavior. Additionally, a top-down camera and a front-facing camera are employed to capture the experimental process from various perspectives.

A dataset of 18,000 frames, around 30 minutes, comprises the training data, featuring 200 episodes, each spanning 90 frames. Within every episode, we execute six actions on the fabric bag. Before each episode’s inception, the fabric bag is orderly arranged to ensure the opening rim protrudes outward without excessive wrinkling. The data collection policy indiscriminately selects from the action space parameters after doing collision checking and bag constraint checking, which include translations along the $x$ , $y$ , $z$ axes, and the rotation around $rz$ . Throughout each episode, we archive partial point clouds captured by RGB-D cameras, as well as the robot end-effector’s poses as returned by the robot controller. Data recording is contingent on the gripper’s movement, an approach taken to optimize memory consumption.

As illustrated in Fig. 6(b)-(d), this study considers three SOI shape categories: Long Oval (LO), Round Oval (RO), and Short Oval (SO). The experiment comprises two types of manipulation tasks. In the SOI shape preservation task, the robotic manipulators aim to keep the SOI shape consistent while moving the fabric bag from its initial position to a predefined target location. Both the initial and target configurations share the same shape but differ in their spatial transformations. Conversely, the SOI shape servoing task involves altering the SOI shape from one configuration to another distinct shape configuration by manipulating the bag handles. We conduct a quantitative evaluation of each component within our framework, encompassing SOI particle sampling, the SOI dynamics model, and performance results of designed manipulation tasks. Our primary metrics for evaluating the manipulation performance are Chamfer distance (CD) and Earth mover’s distance (EMD), along with a hybrid metric that combines both. Furthermore, we employ Geodesic distance (GD) [20] to assess the proximity of boundary points residing on one manifold with boundary to those on another.

TABLE II: The performance of the SOI particle dynamics model with different loss function.

Loss Functions	CD(cm) $\downarrow$	EMD(cm) $\downarrow$	GD(cm) $\downarrow$
CD	$1.52\pm 0.19$	$1.94\pm 0.18$	$2.53\pm 0.25$
EMD	$1.85\pm 0.17$	$\mathbf{1.37}\pm 0.15$	$2.76\pm 0.13$
$0.2$ CD + $0.8$ EMD	$1.59\pm 0.16$	$1.40\pm 0.17$	$2.13\pm 0.18$
$0.1$ CD + $0.9$ EMD	$1.55\pm 0.17$	$1.43\pm 0.16$	$2.28\pm 0.19$
$0.15$ CD + $0.85$ EMD	$\mathbf{1.50}\pm 0.16$	$1.38\pm 0.17$	$\mathbf{2.02}\pm 0.16$

V-B SOI Particle Sampling

We commence by benchmarking the proposed Global Particle Sampling (GPS) technique against the Local Particle Sampling (LPS) baseline. LPS initiates by processing the SOI-specific partial point cloud of the fabric bag through its unique color signature. Subsequently, it encapsulates the incomplete SOI-specific cloud with a convex hull. Following this, it performs the point sampling procedure, eventually amalgamating the sampled points into a full set of SOI-specific particles. As shown in Table I, we compute the mean distance between the sampled particles and the ground-truth particles, the latter captured by a professional 3D scanner. Our analysis indicates that the GPS method incurs lower losses in terms of all distance metrics, thus outperforming the LPS approach. These outcomes align with the premise that incorporating additional topological information about the SOI can markedly enhance the quality of sampling, particularly in scenarios where occlusions are present.

V-C GNN-based SOI Dynamics Model

The training of our GNN model, detailed in Section IV-B, commences with the construction of a graph, where edges are formed between vertices that are within a distance threshold of $d=0.04$ . Every vertex and edge are encoded using 3-layer Multilayer Perceptrons (MLPs), featuring hidden and output layers, each with 300 neurons. The propagation module is constituted by a fully connected neural layer with a layer size of 300. For motion prediction, we employ an additional 3-layer MLP with the hidden layer configured to 300 neurons. ReLU activation functions are utilized throughout the neural networks to introduce non-linearity. The model undergoes training for 120 epochs, employing the Adam optimizer. We have selected a batch size of 32 to balance the trade-off between generalization and computational efficiency. The learning rate is set at $5\mathrm{e}$ -4, which is a conventional choice for steady convergence. These hyperparameters were chosen to foster a robust learning process while maintaining the capacity to capture complex patterns within the data.

Subsequently, we evaluate the performance of GNN-based SOI dynamics models using different loss functions. Table 6 indicates that only using CD or EMD individually optimizes for its own metric but not others, but combining CD and EMD losses leads to better overall performance on all metrics, and the Geodesic distance is significantly improved, compared to using them individually. The combination of $0.85~{}\text{CD}+0.15~{}\text{EMD}$ achieves the best performance across all metrics, giving the lowest CD, competitive EMD, and lowest GD, so we select this hybrid distance for the following experiments. The reason is that CD and EMD losses capture different aspects of point cloud similarity. CD compares point-wise distances while EMD measures global shape differences. By combining them, the model is optimized for both local point accuracy and global shape matching. The proper combination of 0.85 CD + 0.15 EMD balances these objectives best.

TABLE III: Mean Hybrid Distance Error for SOI Shape Preserving by Different Dynamics Modeling Methods

Method	Long Oval (LO)	Round Oval (RO)	Short Oval (SO)
MSM	$2.63\pm 0.31$	$2.82\pm 0.38$	$2.94\pm 0.41$
FEM	$1.96\pm 0.27$	$2.13\pm 0.36$	$2.42\pm 0.38$
Ours	$\textbf{1.69}\pm 0.18$	$\textbf{1.84}\pm 0.22$	$\textbf{2.08}\pm 0.21$

TABLE IV: Mean Hybrid Distance Error and Success Rate for SOI Shape Servoing by Different Methods

Method	LO $\rightarrow$ SO	SO $\rightarrow$ RO	LO $\rightarrow$ RO	LO $\rightarrow$ SO	SO $\rightarrow$ RO	LO $\rightarrow$ RO	Total
Method	Mean Hybrid Distance Error			Success Rate
VS [33]	$3.75\pm 0.19$	$3.69\pm 0.22$	$3.92\pm 0.23$	$21/30$	$19/30$	$22/30$	$68.89\%$
LSC [34]	$2.94\pm 0.48$	$3.14\pm 0.42$	$2.89\pm 0.44$	$24/30$	$24/30$	$26/30$	$82.23\%$
MSM [35]	$3.22\pm 0.43$	$3.49\pm 0.47$	$3.28\pm 0.39$	$19/30$	$21/30$	$23/30$	$70.00\%$
FEM [36]	$2.61\pm 0.36$	$2.82\pm 0.39$	$2.67\pm 0.33$	$23/30$	$22/30$	$26/30$	$78.89\%$
Ours	$2.25\pm 0.23$	$2.39\pm 0.28$	$2.13\pm 0.24$	$28/30$	$30/30$	$29/30$	96.67%

V-D Manipulation Results

To evaluate the performance of the proposed SOI-based dynamics model for deformable fabric bag manipulation, we conducted SOI preserving and servoing experiments on a fabric bag with different oval shapes and the transitions between them. In SOI preserving experiments, we mainly compare the performance of our proposed approach with two commonly-used and well-established dynamics modeling techniques: the Mass-Spring Model (MSM) [35] and the Finite Element Model (FEM) [36]. The comparative analysis encompasses both quantitative and qualitative outcomes, as encapsulated in Figure 7(a) and Table III. Our findings reveal a discernible trend across all examined methods; there is an incremental rise in error corresponding to the transformation of the bag’s shape from a Long Oval (LO) to a Short Oval (SO), suggesting that as the bag opening rim becomes shorter and wider, the complexity of preserving its structure increases. Notably, our GNN-based dynamics model consistently outperforms the MSM and FEM across the entire spectrum of object shapes, achieving the lowest error rates in tasks dedicated to shape preservation. This outcome corroborates the superior expressive capability of our GNN-based model in capturing the intricate deformations of the fabric bags, and maintaining the SOI shape unchanged during the fabric bag moving, surpassing the traditional MSM and FEM approaches, particularly when dealing with objects that exhibit more complex dynamic behaviors. The advantage of our model is most pronounced when interacting with simpler geometric shapes.

In our SOI servoing experiments, we conduct a comprehensive analysis of the manipulation results associated with transitions between distinct shape categories of the structure of interest (SOI). Additionally, we juxtapose our dynamics model with two sophisticated shape servoing techniques tailored for deformable objects–visual servoing (VS) as described by Lagneau et al. [33], and the latent shape control (LSC) model introduced by Qi et al. [34]. The analysis integrates both quantitative and qualitative assessments, as depicted in Fig. 7(b) and detailed in Table IV. We consider three specific shape servoing tasks: LO (Long Oval) $\rightarrow$ SO (Short Oval), SO $\rightarrow$ RO (Round Oval), and LO $\rightarrow$ RO. The evaluation metrics include the Mean Hybrid Distance Error (recorded solely for successful trials) and the Success Rate of SOI servoing. Furthermore, we report the success rate for achieving complete servoing with a threshold error of less than $5$ cm, based on 30 trials for each method and shape transition scenario. The experimental outcomes reveal that our proposed method consistently secured the least shape error, ranging from $2.13$ to $2.39$ cm across all tested cases, and achieved an outstanding overall success rate of $96.67\%$ . When contrasted with traditional model-based techniques such as the Mass-Spring Model (MSM) and the Finite Element Model (FEM), our data-driven SOI dynamics model exhibited a markedly enhanced capability in manipulating the SOI for the fabric bag. It was observed that transformations involving shorter and wider shapes incurred higher errors due to their inherently complex dynamics. Nevertheless, our proposed method demonstrated a robust performance across all tasks. In conclusion, the bimanual manipulation experiments underscore the effectiveness of the proposed SOI-based dynamics modeling approach, which ensures precise and dependable control over the shapes of deformable objects through advanced predictive modeling and optimization techniques.

VI Conclusion

In this work, we introduced a novel bimanual manipulation framework for deformable bags, centered around a Structure of Interest (SOI)-based latent dynamics model. Our approach effectively integrates multi-view perception, graph neural networks, and model predictive control, facilitating precise and efficient robotic manipulation of flexible materials. Through simulations and real-world experiments, the framework demonstrated promising results in intelligent physical interaction with deformable objects. One limitation of the current system is its reliance on differently colored SOIs to distinguish target manipulation areas, which may not be feasible in all operational settings. Future efforts will aim to enhance the system’s ability to identify and manipulate SOIs without such visual aids, broadening the applicability of our method to a wider array of real-world scenarios.

References

[1] H. Yin, A. Varava, and D. Kragic, “Modeling, learning, perception, and control methods for deformable object manipulation,” Sci. Robot., vol. 6, no. 54, 2021.
[2] J. Zhu, A. Cherubini, C. Dune, D. Navarro-Alarcon, F. Alambeigi, D. Berenson, F. Ficuciello, K. Harada, J. Kober, X. Li et al., “Challenges and outlook in robotic manipulation of deformable objects,” IEEE Robotics & Automation Magazine, vol. 29, no. 3, pp. 67–77, 2022.
[3] Z. Hu, T. Han, P. Sun, J. Pan, and D. Manocha, “3-d deformable object manipulation using deep neural networks,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4255–4261, 2019.
[4] I. Garcia-Camacho, J. Borràs, B. Calli, A. Norton, and G. Alenyà, “Household cloth object set: Fostering benchmarking in deformable object manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 5866–5873, 2022.
[5] S. Huo, A. Duan, C. Li, P. Zhou, W. Ma, H. Wang, and D. Navarro-Alarcon, “Keypoint-based planar bimanual sha** of deformable linear objects under environmental constraints with hierarchical action framework,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5222–5229, 2022.
[6] X. Lin, Z. Huang, Y. Li, J. B. Tenenbaum, D. Held, and C. Gan, “Diffskill: Skill abstraction from differentiable physics for deformable object manipulations with tools,” in International Conference on Learning Representations (ICLR), 2022.
[7] P. Zhou, J. Zhu, S. Huo, and D. Navarro-Alarcon, “LaSeSOM: A latent and semantic representation framework for soft object manipulation,” IEEE Robot. Autom. Lett., vol. 6, no. 3, pp. 5381–5388, 2021.
[8] X. Provot et al., “Deformation constraints in a mass-spring model to describe rigid cloth behaviour,” in Graphics interface. Canadian Information Processing Society, 1995, pp. 147–147.
[9] K. Tabata, H. Seki, T. Tsuji, and T. Hiramitsu, “Mass spring model for non-uniformed deformable linear object toward dexterous manipulation,” Artificial Life and Robotics, vol. 28, no. 4, pp. 812–822, 2023.
[10] V. E. Arriola-Rios, P. Guler, F. Ficuciello, D. Kragic, B. Siciliano, and J. L. Wyatt, “Modeling of deformable objects for robotic manipulation: A tutorial and review,” Frontiers in Robotics and AI, vol. 7, p. 82, 2020.
[11] F. Ficuciello, A. Migliozzi, E. Coevoet, A. Petit, and C. Duriez, “Fem-based deformation control for dexterous manipulation of 3d soft objects,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 4007–4013.
[12] J. Sanchez, K. Mohy El Dine, J. A. Corrales, B.-C. Bouzgarrou, and Y. Mezouar, “Blind manipulation of deformable objects based on force sensing and finite element modeling,” Frontiers in Robotics and AI, vol. 7, p. 73, 2020.
[13] Z. Xu, C. Chi, B. Burchfiel, E. Cousineau, S. Feng, and S. Song, “Dextairity: Deformable manipulation can be a breeze,” arXiv preprint arXiv:2203.01197, 2022.
[14] P. Zhou, P. Zheng, J. Qi, C. Li, H.-Y. Lee, A. Duan, L. Lu, Z. Li, L. Hu, and D. Navarro-Alarcon, “Reactive human–robot collaborative manipulation of deformable linear objects using a new topological latent control model,” Robotics and Computer-Integrated Manufacturing, vol. 88, p. 102727, 2024.
[15] D. Navarro-Alarcon, H. M. Yip, Z. Wang, Y.-H. Liu, F. Zhong, T. Zhang, and P. Li, “Automatic 3-d manipulation of soft objects by robotic arms with an adaptive deformation model,” IEEE Transactions on Robotics, vol. 32, no. 2, pp. 429–441, 2016.
[16] J. Qi, G. Ma, J. Zhu, P. Zhou, Y. Lyu, H. Zhang, and D. Navarro-Alarcon, “Contour moments based manipulation of composite rigid-deformable objects with finite time model estimation and shape/position control,” IEEE/ASME Transactions on Mechatronics, vol. 27, no. 5, pp. 2985–2996, 2021.
[17] X. Lin, C. Qi, Y. Zhang, Z. Huang, K. Fragkiadaki, Y. Li, C. Gan, and D. Held, “Planning with spatial-temporal abstraction from point clouds for deformable object manipulation,” in Conference on Robot Learning (CoRL), 2022.
[18] A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine, “Combining self-supervised learning and imitation for vision-based rope manipulation,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 2146–2153.
[19] M. Yan, Y. Zhu, N. **, and J. Bohg, “Self-supervised learning of state estimation for manipulating deformable linear objects,” IEEE robotics and automation letters, vol. 5, no. 2, pp. 2372–2379, 2020.
[20] P. Reiser, M. Neubert, A. Eberhard, L. Torresi, C. Zhou, C. Shao, H. Metni, C. van Hoesel, H. Schopmans, T. Sommer et al., “Graph neural networks for materials science and chemistry,” Communications Materials, vol. 3, no. 1, p. 93, 2022.
[21] J. Shlomi, P. Battaglia, and J.-R. Vlimant, “Graph neural networks in particle physics,” Machine Learning: Science and Technology, vol. 2, no. 2, p. 021001, 2020.
[22] J. Gasteiger, F. Becker, and S. Günnemann, “Gemnet: Universal directional graph neural networks for molecules,” Advances in Neural Information Processing Systems, vol. 34, pp. 6790–6802, 2021.
[23] P. Zhou, J. Qi, A. Duan, S. Huo, Z. Wu, and D. Navarro-Alarcon, “Imitating tool-based garment folding from a single visual observation using hand-object graph dynamics,” IEEE Transactions on Industrial Informatics, 2024.
[24] T. Wang, R. Liao, J. Ba, and S. Fidler, “Nervenet: Learning structured policy with graph neural networks,” in International conference on learning representations, 2018.
[25] E. Tolstaya, F. Gama, J. Paulos, G. Pappas, V. Kumar, and A. Ribeiro, “Learning decentralized controllers for robot swarms with graph neural networks,” in Conference on robot learning. PMLR, 2020, pp. 671–682.
[26] Q. Li, F. Gama, A. Ribeiro, and A. Prorok, “Graph neural networks for decentralized multi-robot path planning,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 11 785–11 792.
[27] E. Tolstaya, J. Paulos, V. Kumar, and A. Ribeiro, “Multi-robot coverage and exploration using spatial graph neural networks,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 8944–8950.
[28] H. Shi, H. Xu, Z. Huang, Y. Li, and J. Wu, “Robocraft: Learning to see, simulate, and shape elasto-plastic objects with graph networks,” arXiv preprint arXiv:2205.02909, 2022.
[29] C. Wang, Y. Zhang, X. Zhang, Z. Wu, X. Zhu, S. **, T. Tang, and M. Tomizuka, “Offline-online learning of deformation model for cable manipulation with graph neural networks,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5544–5551, 2022.
[30] Y. Rubanova, A. Sanchez-Gonzalez, T. Pfaff, and P. Battaglia, “Constraint-based graph network simulator,” arXiv preprint arXiv:2112.09161, 2021.
[31] H. Bertiche, M. Madadi, and S. Escalera, “Neural cloth simulation,” ACM Transactions on Graphics (TOG), vol. 41, no. 6, pp. 1–14, 2022.
[32] R. Fletcher, Practical methods of optimization. John Wiley & Sons, 2000.
[33] R. Lagneau, A. Krupa, and M. Marchal, “Active deformation through visual servoing of soft objects,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 8978–8984.
[34] J. Qi, G. Ma, P. Zhou, H. Zhang, Y. Lyu, and D. Navarro-Alarcon, “Towards latent space based manipulation of elastic rods using autoencoder models and robust centerline extractions,” Advanced Robotics, vol. 36, no. 3, pp. 101–115, 2022.
[35] F. Makiyeh, M. Marchal, F. Chaumette, and A. Krupa, “Indirect positioning of a 3d point on a soft object using rgb-d visual servoing and a mass-spring model,” in 2022 17th International Conference on Control, Automation, Robotics and Vision (ICARCV). IEEE, 2022, pp. 235–242.
[36] Z. Zhang, T. M. Bieze, J. Dequidt, A. Kruszewski, and C. Duriez, “Visual servoing control of soft robots based on finite element model,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 2895–2901.