As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation
Using 2D Diffusion Priors

Seungwoo Yoo^*1 Kunho Kim^*1 Vladimir G. Kim² Minhyuk Sung¹
¹KAIST ²Adobe Research

Abstract

We present As-Plausible-as-Possible (APAP) mesh deformation technique that leverages 2D diffusion priors to preserve the plausibility of a mesh under user-controlled deformation. Our framework uses per-face Jacobians to represent mesh deformations, where mesh vertex coordinates are computed via a differentiable Poisson Solve. The deformed mesh is rendered, and the resulting 2D image is used in the Score Distillation Sampling (SDS) process, which enables extracting meaningful plausibility priors from a pretrained 2D diffusion model. To better preserve the identity of the edited mesh, we fine-tune our 2D diffusion model with LoRA. Gradients extracted by SDS and a user-prescribed handle displacement are then backpropagated to the per-face Jacobians, and we use iterative gradient descent to compute the final deformation that balances between the user edit and the output plausibility. We evaluate our method with 2D and 3D meshes and demonstrate qualitative and quantitative improvements when using plausibility priors over geometry-preservation or distortion-minimization priors used by previous techniques. Our project page is at: https://as-plausible-as-possible.github.io/

Figure 1: APAP, our novel shape deformation method, enables plausibility-aware mesh deformation and preservation of fine details of the original mesh offering an interface that alters geometry by directly displacing a handle (red) along a direction (gray), fixing an anchor vertex (green). Using a diffusion prior results in smoother geometry around the armchair handle, as seen in the example (middle column).

^*^*footnotetext: Equal contribution.

1 Introduction

For 2D and 3D content, mesh is the most prevalent representation, thanks to its efficiency in storage, simplicity in rendering and also compatibility in common graphics pipelines, versatility in diverse applications such as design, physical simulation, and 3D printing, and flexibility in terms of decomposing geometry and appearance information, with widespread adoption in the industry.

For the creation of 2D and 3D meshes, recent breakthroughs in generative models [50, 53, 32, 43, 60, 51, 39, 57] have demonstrated significant advances. These breakthroughs enable users to easily generate content from a text prompt [43, 60, 51, 39, 57], or from photos [51, 45]. However, visual content creation typically involves numerous editing processes, deforming the content to satisfy users’ desires through interactions such as mouse clicks and drags. Facilitating such interactive editing has remained relatively underexplored in the context of recent generative techniques.

Mesh deformation is a subject that has been researched for decades in computer graphics. Over time, researchers have established well-defined methodologies, characterizing mesh deformation as an optimization problem that aims to preserve specific geometric properties, such as the Mesh Laplacian [36, 55, 37], local rigidity [18, 54], and mesh surface Jacobians [2, 13], while satisfying given constraints. To facilitate user interaction, these methodologies have been extended to introduce specific user-interactive deformation handles, such as keypoints [20, 59, 28], cage mesh [26, 25, 35, 61, 66, 23], and skeleton [4, 64, 65], with the blending functions defined based on the preservation of geometric properties.

Despite the widespread use of classical mesh deformation methods, they often fail to meet users’ needs because they do not incorporate the perceptual plausibility of the outputs. For example, as illustrated in Fig. 1, when a user intends to drag a point on the top of a table image, the classical deformation technique may introduce unnatural bending instead of lifting the tabletop. This limitation arises because deformation techniques solely based on geometric properties do not incorporate such semantic and perceptual priors, resulting in the mesh editing process becoming more tedious and time-consuming.

Recent learning-based mesh deformation techniques [66, 23, 28, 64, 38, 2, 56] have attempted to address this problem in a data-driven way. However, they are also limited by relying on the existence of certain variations in the training data. Even recent large-scale 3D datasets [6, 63, 8, 7] have not reached the scale that covers all possible visual content users might intend to create.

To this end, we introduce our novel mesh deformation framework, dubbed APAP (As-Plausible-As-Possible), which exploits 2D image priors from a diffusion model pretrained on an Internet-scale image dataset to enhance the plausibility of deformed 2D and 3D meshes while preserving the geometric priors of the given shape. Recently, score distillation sampling (SDS) [43] has demonstrated great success in generating plausible 2D and 3D content, such as NeRF [69, 29, 24] and vector images [22, 19], using the distilled 2D image priors from a diffusion model. We incorporate these diffusion-model-based 2D priors into the optimization-based deformation framework, achieving the best synergy between geometry-based optimization and distilled-prior-based optimization.

To achieve this optimal synergy between geometric and perceptual priors within a unified framework, we introduce an alternative optimization approach. At each step, we first update the Jacobian of each mesh face using the SDS loss and user-provided constraints. Subsequently, the mesh vertex positions are recalculated by solving Poisson’s equation with the updated face Jacobians. The direct application of the 2D diffusion prior via SDS, however, tends to compromise the identity of the given objects—an essential aspect in deformation. We thus enhance the identity awareness of the diffusion prior by finetuning it with the provided source image. The model is integrated into our two-stage pipeline that initiates deformation without the perceptual prior (SDS) and refines it with SDS and the given constraints afterward to create deformations that adhere to user-defined editing instructions while remaining visually plausible.

In experiments, we examine APAP using APAP-Bench consisting of 3D and 2D triangular meshes and editing instructions. The proposed method produces plausible deformations of 3D meshes compared to its baseline [54] based exclusively on a geometric prior. Evaluation in the task of 2D mesh editing further verifies the effectiveness of APAP as illustrated by the highest $k$ -NN GIQA score [14] in quantitative analysis, and the higher preference over the baseline in a user study.

2 Related Work

2.1 Geometric Mesh Deformation

Mesh deformation has been one of the central problems in geometry processing and is thus addressed by a wide range of techniques. Cage-based methods [26, 25, 35, 61] let users alter meshes by manipulating cages enclosing them, calculating a point inside as a weighted sum of cage vertices. Skeleton-based approaches [62, 4, 64, 65] offer animation control by map** surface points to underlying joints and bones, ideal for animating human/animal-like figures. Unlike the previous techniques that require the manual cage or skeleton construction, biharmonic coordinates-based methods [20, 59] automate establishing map**s from control points to vertices by formulating optimization problems. Other types of works instead allow users to manipulate shapes via direct vertex displacement while imposing constraints on local surface geometry, including rigidity [18, 54] and Laplacian smoothness [36, 55, 37]. Such hand-crafted deformation priors often lack consideration of visual plausibility, necessitating careful control point placement and iterative manual refinement to achieve satisfactory results.

2.2 Data-Driven Mesh Deformation

Data-driven approaches to mesh deformation [66, 23, 28, 64, 38, 2, 56] learn from shape collections, utilizing neural networks to infer parameters for classical deformation techniques, such as cage vertex coordinates and displacements [66], keypoints [23, 59, 28], subspaces of keypoint arrangements [38], differential coordinates [2], etc. However, these methods assume the availability of large-scale category-specific shape collection [66, 23, 59, 28, 64] or require dense correspondences between them [56, 2], limiting their applicability to new, out-of-sample shapes. We instead propose to directly mine deformation priors from pretrained diffusion models. Leveraging a generic (category-agnostic) image generative model trained on an Internet-scale image dataset, we devise a method that easily generalizes to novel 2D and 3D shapes while lifting the requirement for shape collections.

2.3 Pretrained 2D Priors for Shape Manipulation

Image analysis [44] and generation [47, 67, 3, 34] techniques can serve as effective visual priors for image editing tasks [5, 16, 58, 68, 52]. In addition, recent work [12, 48] and their adaption [11], enable personalized image generation and editing by learning a text embedding [12] or fine-tuning additional parameters, such as LoRA [17] to preserve and replicate the identities of given exemplars during editing. One interesting work is DragDiffusion [52], akin to DragGAN [41], which introduces a drag-based user interface for image editing through the manipulation of latent representations. However, it is not extendable to the deformation of parametric images, such as 2D meshes, and also 3D shapes. Another interesting line of works [40, 27, 13] extends the idea further to manipulate shapes by propagating image-based gradients to the underlying shape representations. They maximize CLIP [44] similarity between the renderings and text prompts to either add geometric textures [40], jointly update both vertices and texture [27], or deform a shape parameterized by per-triangle Jacobians [13]. In contrast to such text-driven editing techniques, we build on Score Distillation Sampling (SDS) [43] to enable direct manipulation of shapes via handle displacement, ensuring visual plausibility. While the technique is prevalent in various problems ranging from text-to-3D [43, 60, 51, 39, 57], image editing [15] and neural field editing [69], it has not been adopted for shape deformation.

3 Method

We present APAP, a novel handle-based mesh deformation framework capable of producing visually plausible deformations of either 2D or 3D triangular meshes. To achieve this goal, we integrate powerful 2D diffusion priors into a learnable Jacobian field representation of shapes.

We emphasize that leveraging 2D priors, such as latent diffusion models (LDMs) [47] trained on large-scale datasets [49], for shape deformation poses challenges that require meticulous design choices. The following sections will delve into the details of shape representation (Sec. 3.1) and diffusion prior (Sec. 3.2), offering a rationale for the design decisions underpinning our framework (Sec. 3.3).

3.1 Representing Shapes as Jacobian Fields

Let $\mathcal{M}_{0}=\left(\mathbf{V}_{0},\mathbf{F}_{0}\right)$ denote a source mesh to be deformed, represented by vertices $\mathbf{V}_{0}\in\mathbb{R}^{V\times 3}$ and faces $\mathbf{F}_{0}\in\mathbb{R}^{F\times 3}$ . Users are allowed to select a set of vertices used as movable handles designated by an indicator matrix $\mathbf{K}_{h}\in\{0,1\}^{V_{h}\times V}$ . We also require users to select a set of anchors, represented as another indicator matrix $\mathbf{K}_{a}\in\{0,1\}^{V_{a}\times V}$ , to avoid trivial solutions (i.e., global translations). Then, the handle and anchor vertices become $\mathbf{V}_{h}=\mathbf{K}_{h}\mathbf{V}_{0}$ and $\mathbf{V}_{a}=\mathbf{K}_{a}\mathbf{V}_{0}$ .

Our framework also expects a set of vectors $\mathbf{D}_{h}\in\mathbb{R}^{V_{h}\times 3}$ that indicate the directions along which the handles will be displaced. Furthermore, we let $\mathbf{T}_{h}=\mathbf{V}_{h}+\mathbf{D}_{h}$ and $\mathbf{T}_{a}=\mathbf{V}_{a}$ denote the target positions of the user-specified handles and anchors, respectively.

In this work, we employ a Jacobian field $\mathbf{J}_{0}=\{\mathbf{J}_{0,f}|f\in\mathbf{F}_{0}\}$ , a dual representation of $\mathcal{M}_{0}$ , defined as a set of per-face Jacobians $\mathbf{J}_{0,f}\in\mathbb{R}^{3\times 3}$ where

\displaystyle\mathbf{J}_{0,f}

\displaystyle=\boldsymbol{\nabla}_{f}\mathbf{V}_{0},

(1)

and $\boldsymbol{\nabla}_{f}$ is the gradient operator of triangle $f$ .

Conversely, we compute a set of deformed vertices $\mathbf{V}^{*}$ from a given Jacobian field $\mathbf{J}$ by solving a Poisson’s equation

\displaystyle\mathbf{V}^{*}=\operatorname*{arg\,min}_{\mathbf{V}}\|\mathbf{L}% \mathbf{V}-\boldsymbol{\nabla}^{T}\mathcal{A}\mathbf{J}\|^{2},

(2)

where $\boldsymbol{\nabla}$ is a stack of per-face gradient operators, $\mathcal{A}\in\mathbb{R}^{3F\times 3F}$ is the mass matrix and $\mathbf{L}\in\mathbb{R}^{V\times V}$ is the cotangent Laplacian of $\mathcal{M}_{0}$ , respectively. Since $\mathbf{L}$ is rank-deficient, the solution of Eqn. 2 cannot be uniquely determined unless we impose constraints. We thus consider a constrained optimization problem

\displaystyle\mathbf{V}^{*}=\operatorname*{arg\,min}_{\mathbf{V}}\|\mathbf{L}% \mathbf{V}-\boldsymbol{\nabla}^{T}\mathcal{A}\mathbf{J}\|^{2}+\lambda\|\mathbf% {K}_{a}\mathbf{V}-\mathbf{T}_{a}\|^{2},

(3)

where $\lambda\in\mathbb{R}^{+}$ is a weight for the constraint term. Note that we solve Eqn. 3 with the user-specified anchors as constraints to determine $\mathbf{V}^{*}$ .

Taking the derivative with respect to $\mathbf{V}$ , the problem in Eqn. 3 turns into a system of equations

\displaystyle\left(\mathbf{L}^{T}\mathbf{L}+\lambda\mathbf{K}_{a}^{T}\mathbf{K% }_{a}\right)\mathbf{V}=\mathbf{L}^{T}\boldsymbol{\nabla}^{T}\mathcal{A}\mathbf% {J}+\lambda\mathbf{K}_{a}^{T}\textbf{T}_{a},

(4)

which can be efficiently solved using a differentiable solver [2] implementing Cholesky decomposition.

We let $g$ denote a functional representing the aforementioned differentiable solver for notational convenience, $\mathbf{V}^{*}=g\left(\mathbf{J},\mathbf{K}_{a},\mathbf{T}_{a}\right)$ . Since $g$ is differentiable, we can deform $\mathcal{M}_{0}$ by propagating upstream gradients from various loss functions to the underlying parameterization $\mathbf{J}$ . For instance, one may impose a soft constraint on the locations of selected handles during optimization with the objective of the form:

\displaystyle\mathcal{L}_{h}=\|\mathbf{K}_{h}\mathbf{V}^{*}-\mathbf{T}_{h}\|^{% 2}.

(5)

We will discuss how such a soft constraint can be blended into our framework in Sec. 3.3. Next, we describe how to incorporate a pretrained diffusion model as a prior for visual plausibility.

Refer to caption — Figure 2: The overview of APAP. APAP parameterizes a triangular mesh as a per-face Jacobian field that can be updated via gradient-descent. Given a textured mesh and user inputs specifying the handle(s) and anchor(s), our framework initializes a Jacobian field as a trainable parameter. During the first stage, the Jacobian field is updated via iterative optimization of $\mathcal{L}_{h}$ , a soft constraint that initially deforms the shape according to the user’s instruction. In the following stage, the mesh is rendered using a differentiable renderer $\mathcal{R}$ and the rendered image is provided as an input to a diffusion prior finetuned with LoRA [17] that computes the SDS loss $\mathcal{L}_{\text{SDS}}$ . The joint optimization of $\mathcal{L}_{h}$ and $\mathcal{L}_{\text{SDS}}$ improves the visual plausibility of the mesh while conforming to the given edit instruction.

3.2 Score Distillation for Shape Deformation

While traditional mesh deformation techniques make variations that match the given geometric constraints, their lack of consideration on visual plausibility results in unrealistic shapes. Motivated by recent success in text-to-3D literature, we harness a powerful 2D diffusion prior [47] in our framework as a critic that directs deformation by scoring the realism of the current shape.

Specifically, we distill its prior knowledge via Score Distillation Sampling (SDS) [43]. Let $\mathbf{J}$ denote the current Jacobian field and $\mathbf{V}^{*}$ be the set of vertices computed from $\mathbf{J}$ following the procedure described in Sec. 3.1.

We render $\mathcal{M}^{*}=\left(\mathbf{V}^{*},\mathbf{F}\right)$ from a viewpoint defined by camera extrinsic parameters $\mathbf{C}$ using a differentiable renderer $\mathcal{R}$ , producing an image $\mathcal{I}=\mathcal{R}\left(\mathcal{M}^{*},\mathbf{C}\right)$ . The diffusion prior $\hat{\epsilon}_{\phi}$ then rates the realism of $\mathcal{I}$ , producing a gradient

\displaystyle\nabla_{\mathbf{J}}\mathcal{L}_{\text{SDS}}\left(\phi,\mathcal{I}% \right)=\mathbb{E}_{t,\epsilon}\left[w\left(t\right)\left(\hat{\epsilon}_{\phi% }\left(\mathbf{z}_{t};y,t\right)-\epsilon\right)\frac{\partial\mathcal{I}}{% \partial\mathbf{J}}\right],

(6)

where $t\sim\mathcal{U}\left(0,1\right)$ , $\epsilon\sim\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)$ , and $\mathbf{z}_{t}$ is a noisy latent embedding of $\mathcal{I}$ . The propagated gradient alters the geometry of $\mathcal{M}$ by modifying $\mathbf{J}$ .

To increase the instance-awareness of the diffusion model, we follow recent work [48, 52] on personalized image editing and finetune the model using LoRA [17]. In particular, we first render $\mathcal{M}$ from $n$ different viewpoints to obtain a set $\mathcal{I}=\{\mathcal{I}_{1},\dots,\mathcal{I}_{n}\}$ of training images and inject additional parameters to the model, resulting in an expanded set of network parameters $\phi^{\prime}$ . The parameters are then optimized with a denoising loss [47]

\displaystyle\mathcal{L}=\mathbb{E}_{t,\epsilon,\mathbf{z}}\left[\|\hat{% \epsilon}_{\phi^{\prime}}\left(\mathbf{z}_{t};y,t\right)-\epsilon\|^{2}\right],

(7)

where $\mathbf{z}_{t}$ denotes a latent of a training image perturbed with noise at timestep $t$ .

The finetuned diffusion prior, together with a learnable Jacobian field representation of the source mesh $\mathcal{M}_{0}$ , comprises the proposed framework described in the following section.

3.3 As-Plausible-As-Possible (APAP)

Algorithm 1 As-Plausible-As-Possible

Parameters:

g

\mathcal{R}

\phi

\gamma

M

N

Inputs:

\mathcal{M}_{0}=\left(\mathbf{V}_{0},\mathbf{F}_{0}\right)

\mathbf{K}_{a}

\mathbf{K}_{h}

\mathbf{T}_{a}

\mathbf{T}_{h}

\left\{\mathbf{C}_{i}\right\}_{i=1}^{n}

Output:

\mathcal{M}

procedure FirstStage(

\mathbf{J}

\mathbf{K}_{a}

\mathbf{K}_{h}

\mathbf{T}_{a}

\mathbf{T}_{h}

g

)

for

i=1,2,\dots,M

\mathbf{V}^{*}

\leftarrow

g\left(\mathbf{J},\mathbf{K}_{a},\mathbf{T}_{a}\right)

\triangleright

Solving Eqn. 4

\mathbf{J}

\leftarrow

\mathbf{J}-\gamma\nabla_{\mathbf{J}}\mathcal{L}_{h}\left(\mathbf{V}^{*},% \mathbf{K}_{h},\mathbf{T}_{h}\right)

end for

return

\mathbf{J}

end procedure

procedure SecondStage(

\mathbf{J}

\mathbf{F}_{0}

\mathbf{K}_{a}

\mathbf{K}_{h}

\mathbf{T}_{a}

\mathbf{T}_{h}

g

\phi

\left\{\mathbf{C}_{i}\right\}

)

for

i=1,2,\dots,N

\mathbf{V}^{*}

\leftarrow

g\left(\mathbf{J},\mathbf{K}_{a},\mathbf{T}_{a}\right)

\triangleright

Solving Eqn. 4

\mathcal{M}^{*}

\leftarrow

\left(\mathbf{V}^{*},\mathbf{F}_{0}\right)

\mathbf{C}

\sim

\mathcal{U}

(

\left\{\mathbf{C}_{i}\right\}

)

\triangleright

Viewpoint Sampling

\mathcal{I}

\leftarrow

\mathcal{R}\left(\mathcal{M}^{*},\mathbf{C}\right)

\triangleright

Rendering

\mathbf{J}

\leftarrow

\mathbf{J}-\gamma\nabla_{\mathbf{J}}\left(\mathcal{L}_{\text{SDS}}\left(\phi,% \mathcal{I}\right)+\mathcal{L}_{h}\left(\mathbf{V}^{*},\mathbf{K}_{h},\mathbf{% T}_{h}\right)\right)

end for

return

\mathbf{J}

end procedure

\phi

\leftarrow

LoRA(

\phi

\mathcal{M}_{0}

\mathcal{R}

\left\{\mathbf{C}_{i}\right\}

)

\mathbf{J}

\leftarrow

\{\mathbf{J}_{0,f}|f\in\mathbf{F}_{0}\}

\mathbf{J}

\leftarrow

FirstStage(

\mathbf{J}

\mathbf{K}_{a}

\mathbf{K}_{h}

\mathbf{T}_{a}

\mathbf{T}_{h}

g

)

\mathbf{J}

\leftarrow

SecondStage(

\mathbf{J}

\mathbf{F}_{0}

\mathbf{K}_{a}

\mathbf{K}_{h}

\mathbf{T}_{a}

\mathbf{T}_{h}

g

\phi

\left\{\mathbf{C}_{i}\right\}

)

\mathbf{V}

\leftarrow

g\left(\mathbf{J},\mathbf{K}_{a},\mathbf{T}_{a}\right)

\mathcal{M}

\leftarrow

\left(\mathbf{V},\mathbf{F}_{0}\right)

return

\mathcal{M}

APAP tackles the problem of plausibility-aware shape deformation by harmonizing the best of both worlds: a learnable shape representation founded on classical geometry processing, robust to noisy gradients, and a powerful 2D diffusion prior finetuned with the image(s) of the source mesh for better instance-awareness.

We provide an overview of the proposed pipeline in Fig. 2 and the algorithm in Alg. 1. We will delve into details in the following. Provided with a textured mesh $\mathcal{M}_{0}$ , handles $\mathbf{K}_{h}$ , anchors $\mathbf{K}_{a}$ , as well as their target positions $\mathbf{T}_{h}$ and $\mathbf{T}_{a}$ as inputs, APAP yields a plausible deformation $\mathcal{M}$ of $\mathcal{M}_{0}$ that conforms to the given handle-target constraints. Before deforming $\mathcal{M}_{0}$ , we render $\mathcal{M}_{0}$ from a single view in the case of 2D meshes and four canonical views (i.e., front, back, left, and right) for 3D meshes and use the images to finetune Stable Diffusion [47] by optimizing LoRA [17] parameters injected to the model (the red line in Fig. 2). Simultaneously, APAP computes the Jacobian field $\mathbf{J}_{0}$ of the input mesh $\mathcal{M}_{0}$ and initializes it as a trainable parameter $\mathbf{J}$ .

APAP deforms the input mesh through two stages. In the FirstStage, it first deforms the input mesh according to instructions from users without taking visual plausibility into account. The subsequent SecondStage integrates a 2D diffusion prior into the optimization loop, simultaneously enforcing user constraints and visual plausibility.

At every iteration of the FirstStage illustrated as the blue box in Fig. 2, we compute the vertex positions $\mathbf{V}^{*}$ corresponding to the current Jacobian field $\mathbf{J}$ by solving Eqn. 3 using the anchors specified by $\mathbf{K}_{a}$ as hard constraints. Then, we compute the soft constraint $\mathcal{L}_{h}$ defined as Eqn. 5 that drags a set of handle vertices $\mathbf{K}_{h}\mathbf{V}^{*}$ toward the corresponding targets $\mathbf{T}_{h}$ . The interleaving of differentiable Poisson solve and optimization of $\mathcal{L}_{h}$ via gradient-descent is repeated for $M$ iterations. This progressively updates $\mathbf{J}$ , treated as a learnable black box in our framework, deforming $\mathcal{M}_{0}$ . Consequently, the edited mesh $\mathcal{M}^{*}=\left(\mathbf{J},\mathbf{F}_{0}\right)$ follows user constraints at the cost of the degraded plausibility, mitigated in the following stage through the incorporation of a diffusion prior.

The result of FirstStage then serves as an initialization for the SecondStage, illustrated as the green box in Fig. 2 guided by plausibility constraint $\mathcal{L}_{\text{SDS}}$ . Unlike the FirstStage where the update of $\mathbf{J}$ was purely driven by the geometric constraint $\mathcal{L}_{h}$ , we aim to steer the optimization based on the visual plausibility of the current mesh $\mathcal{M}^{*}$ . To achieve this, we render $\mathcal{M}^{*}$ using a differentiable renderer $\mathcal{R}$ using the same viewpoint(s) from which the training image(s) for finetuning was rendered. When deforming 3D meshes, we randomly sample one viewpoint at each iteration. The rendered image $\mathcal{I}$ is used to evaluate $\mathcal{L}_{\text{SDS}}$ which is optimized jointly with $\mathcal{L}_{h}$ for $N$ iterations. The combination of geometric and plausibility constraints improves the visual plausibility of the output while encouraging it to conform to the given constraints.

We note that the iterative approach in the FirstStage leads to better results than alternative update strategies such as deforming the source mesh $\mathcal{M}_{0}$ by minimizing ARAP energy [54] or, solving Eqn. 3 using both $\mathbf{K}_{h}$ and $\mathbf{K}_{a}$ as hard constraints. In our experiments (Sec. 4), we show that both methods produce distortions that cannot be corrected by the diffusion prior in the subsequent stage. Specifically, directly solving Eqn. 3 using all available constraints only yields the least squares solution $\mathbf{V}^{*}$ without updating the underlying Jacobians $\mathbf{J}$ , resulting in the aforementioned distortions.

4 Experiments

We evaluate APAP in downstream applications involving manipulation of 3D and 2D meshes.

4.1 Experiment Setup

Benchmark.

To evaluate the plausibility of a mesh deformation we propose a novel benchmark APAP-Bench of textured 3D and 2D triangular meshes spanning both human-made and organic objects annotated with handle vertices and their editing directions, and anchor vertices. The set of 3D meshes, APAP-Bench 3D, is constructed using meshes from ShapeNet [6] and Genie [1]. The meshes are normalized to fit in a unit cube. Each mesh is manually annotated with editing instructions, including a set of anchors, handles, and corresponding targets to simulate editing scenarios. APAP-Bench offers another subset called APAP-Bench 2D, a collection of 80 textured, planar meshes of various objects, to facilitate quantitative analysis and user study described later in this section. To create APAP-Bench 2D, we first generate 2 images of real-world objects for each of the 20 categories using Stable Diffusion-XL [42]. We then extract foreground masks from the generated images using SAM [31] and sample pixels that lie on the boundary and interior. The sampled pixels are used for Delaunay triangulation, constrained with the edges along the main contour of the masks, that produces 2D triangular meshes with texture. We assign two handle and anchor pairs to each mesh that imitate user instructions. For evaluation purposes, we populate the reference set by sampling $1,000$ images for each object category using Stable Diffusion-XL. The generated images are used to evaluate a perceptual metric to assess the plausibility of 2D mesh editing results as described in Sec. 4.3.

Baselines.

We compare our method (APAP) and As-Rigid-As-Possible (ARAP) [54] since it is one of the widely used mesh deformation techniques that permits shape manipulation via direct vertex displacement. Throughout the experiments, we use the implementation in libigl [21] with default parameters.

Evaluation Metrics.

In 2D experiments, we conduct quantitative analysis based on $k$ -NN GIQA score [14] as an evaluation metric to assess the plausibility of instance-specific editing results. The metric quantifies the perceptual proximity between the edited image and its $k$ nearest neighbors in the reference set included in APAP-Bench 2D. As our objective is to make plausible variations of 2D meshes via deformation, an edited object should remain perceptually similar to other objects in the same category. We use $k=12$ throughout the experiments.

4.2 3D Shape Deformation

Qualitative Results.

We showcase examples of 3D shape deformation where each deformation is specified by a handle (red), an edit direction (gray), and an anchor (green). As shown in Fig. 3, APAP is capable of manipulating 3D shapes to improve visual plausibility which is not achievable by solely relying on geometric prior such as ARAP [54]. For instance, given a user input that drags a handle on one blade of an axe (the first row) along an arrow, APAP simultaneously expands both blades of the axe whereas ARAP [54] produces distortions near the head. Similar examples that demonstrate symmetry-awareness of APAP can be found in other cases such as a car (the second row), and an owl (the sixth row) where a user lifts only one side of the shape upward and the symmetry is recovered by APAP which cannot be achieved by ARAP [54]. Also, note that APAP is capable of making a smooth articulation at the leg of the wolf (the fourth row) by adjusting the overall posture in comparison to ARAP which creates an excess bending.

4.3 2D Mesh Editing

Qualitative Evaluation.

We present qualitative results using the baselines and our method in Fig. 5. Each row shows two different results obtained by editing an image based on a handle moved from the original position (red) along a direction indicated by an arrow (gray) while fixing an anchor (green), similar to the 3D experiments discussed in the previous section.

As shown in Fig. 5, ARAP [54] enforces local rigidity and often results in implausible deformations. For example, it does not account for the mechanics of the human body and introduces an unrealistic articulation of a human arm (the fourth row). In addition, it twists the body of a sports car (the fifth row). Both of them originate from the lack of understanding of the appearance of objects. APAP alleviates this issue by incorporating a visual prior into shape deformation producing a bending near the elbow and preserving the smooth silhouette of the car, respectively.

While APAP is designed for meshes not images, we provide an additional qualitative comparison against DragDiffusion [52], an image editing technique that operates in pixel space, to demonstrate the effectiveness of mesh-based parameterization in applications where identity preservation is crucial. As shown in Fig. 4, DragDiffusion [52] may corrupt the identity of the instances depicted in input images during the encoding and decoding procedure. APAP, on the other hand, makes plausible variations of the given objects while maintaining their originality, benefiting from an explicit mesh representation it is grounded.

Methods	$k$ -NN GIQA ( $\times 10^{-2}$ ) $\uparrow$
ARAP [54]	4.753
DragDiffusion [52]	4.545
Ours ( $\mathcal{L}_{h}$ Only)	4.797
Ours (ARAP Init.)	4.740
Ours (Poisson Init.)	4.316
Ours	4.887

Table 1: Quantitative analysis for 2D mesh editing. APAP outperforms its baselines in quantitative evaluation using

k

-NN GIQA [14].

Methods	Preference (%) $\uparrow$
ARAP [54]	40.83
Ours	59.17

Table 2: User study preference for 2D image editing. In a user study targeting users on Amazon Mechanical Turk (MTurk), the results produced using ours were preferred over the outputs from the baseline.

Quantitative Evaluation.

Tab. 1 summarizes $k$ -NN GIQA scores measured on the outputs from ARAP [54] (the first row) and APAP (the sixth row) using APAP-Bench 2D. As shown, APAP demonstrates superior performance over ARAP [54]. This again verifies the observations from qualitative evaluation where ARAP [54] introduces distortions that harm visual plausibility. As in qualitative evaluation, we also report the $k$ -NN GIQA score of DragDiffusion [52], degraded due to artifacts caused during direct manipulation of latents.

User Study.

We further conduct a user study for a more precise perceptual analysis. We follow Ritchie [46] and recruit participants on Amazon Mechanical Turk (MTurk). Each participant is provided with a set of $20$ randomly sampled images of the source meshes paired with editing results of ARAP [54] and APAP. To check whether the response from a participant is reliable we present $5$ vigilance tests and collect $102$ responses from the participants who passed the vigilance test.

We instructed participants to select the most anticipated outcome when the displayed source image is edited by the dragging operation visualized as an arrow. We provide details of the user study in the supplementary material. Tab. 2 shows a higher preference of the participants on our method over ARAP [54] implying that our method produces more visually plausible deformations.

Ablation Study.

Tab. 1 summarizes the impact of different initialization strategies in the first stage on $k$ -NN GIQA score. As reported in the third row of the table, optimizing $\mathcal{L}_{h}$ that aims to exclusively satisfy geometric constraints leads to unnatural distortions. We provide a qualitative comparison in the the supplementary material.

While designing the algorithm illustrated in Alg. 1, we considered other options for FirstStage. Instead of optimizing $\mathcal{L}_{h}$ to initially deform a shape, we used a shape produced by ARAP [54] or by solving a Poisson’s equation constrained not only on anchor positions but also on handles at their target positions reached by following the given edit directions. We report $k$ -NN GIQA scores of the alternatives in the fourth and fifth row of Tab. 1, respectively. Both initialization strategies degrade the plausibility of results due to large distortions introduced by either solely enforcing local rigidity or, finding least square solutions without updating Jacobians. This poses a challenge to the diffusion prior, making it struggle to induce meaningful update directions when provided with renderings with noticeable distortions, which can be found in qualitative analysis in the supplementary material.

5 Conclusion

We presented APAP, a novel deformation framework that tackles the problem of plausibility-aware shape deformation while offering intuitive controls over a wide range of shapes represented as triangular meshes. To this end, we carefully orchestrate two core components, a learnable Jacobian-based parameterization that originates from geometry processing and powerful 2D priors acquired by text-to-image diffusion models trained on Internet-scale datasets. We assessed the performance of the proposed method against an existing geometric-prior-based deformation technique and also thoroughly investigated the significance of our design choices through experiments.

Acknowledgements

We appreciate RECON Labs for their support in our experiments. This work was supported by NRF grant (RS-2023-00209723) and IITP grants (2022-0- 00594, RS-2023-00227592) funded by the Korean government (MSIT), Seoul R&BD Program (CY230112), and grants from the DRB-KAIST SketchTheFuture Research Center, Hyundai NGV, KT, NCSOFT, and Samsung Electronics.

References

[1] Luma AI. Genie.
[2] Noam Aigerman, Kunal Gupta, Vladimir G Kim, Siddhartha Chaudhuri, Jun Saito, and Thibault Groueix. Neural Jacobian Fields: Learning Intrinsic Map**s of Arbitrary Meshes. ACM TOG, 2022.
[3] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. In ICML, 2023.
[4] Ilya Baran and Jovan Popović. Automatic Rigging and Animation of 3D Characters. ACM TOG, 2007.
[5] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. In ICCV, 2023.
[6] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv preprint arXiv:1512.03012, 2015.
[7] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-XL: A Universe of 10M+ 3D Objects. In NeurIPS, 2023.
[8] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A Universe of Annotated 3D Objects. In CVPR, 2023.
[9] Yu et al. Mesh editing with poisson-based gradient field manipulation. ACM Trans. Graph., 23(3), 2004.
[10] Hugging Face. Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch.
[11] Hugging Face. DreamBooth fine-tuning with LoRA.
[12] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In ICLR, 2023.
[13] William Gao, Noam Aigerman, Groueix Thibault, Vladimir Kim, and Rana Hanocka. TextDeformer: Geometry Manipulation using Text Guidance. ACM TOG, 2023.
[14] Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. GIQA: Generated Image Quality Assessment. In ECCV, 2020.
[15] Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta Denoising Score. In ICCV, 2023.
[16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-Prompt Image Editing with Cross-Attention Control. In ICLR, 2023.
[17] Yelong Hu, Edward J.and Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR, 2022.
[18] Takeo Igarashi, Tomer Moscovich, and John F. Hughes. As-Rigid-as-Possible Shape Manipulation. ACM TOG, 2005.
[19] Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-As-Image for Semantic Typography. ACM TOG, 2023.
[20] Alec Jacobson, Ilya Baran, Jovan Popović, and Olga Sorkine. Bounded Biharmonic Weights for Real-Time Deformation. ACM TOG, 2011.
[21] Alec Jacobson, Daniele Panozzo, et al. libigl: A simple C++ geometry processing library, 2018.
[22] Ajay Jain, Amber Xie, and Pieter Abbeel. VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models. In CVPR, 2023.
[23] Tomas Jakab, Richard Tucker, Ameesh Makadia, Jiajun Wu, Noah Snavely, and Angjoo Kanazawa. KeypointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control. In CVPR, 2020.
[24] Jong Chul Ye Jangho Park, Gihyun Kwon. ED-NeRF: Efficient Text-Guided Editing of 3D Scene using Latent Space NeRF. arXiv, 2023.
[25] Pushkar Joshi, Mark Meyer, Tony DeRose, Brian Green, and Tom Sanocki. Harmonic Coordinates for Character Articulation. ACM TOG, 2007.
[26] Tao Ju, Scott Schaefer, and Joe Warren. Mean Value Coordinates for Closed Triangular Meshes. ACM TOG, 2005.
[27] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Popa Tiberiu. CLIP-Mesh: Generating Textured Meshes from Text Using Pretrained Image-Text Models. SIGGRAPH ASIA, 2022.
[28] Kunho Kim, Mikaela Angelina Uy, Despoina Paschalidou, Alec Jacobson, Leonidas J. Guibas, and Minhyuk Sung. OptCtrlPoints: Finding the Optimal Control Points for Biharmonic 3D Shape Deformation. Computer Graphics Forum, 2023.
[29] Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, and **woo Shin. Collaborative Score Distillation for Consistent Visual Synthesis. In NeurIPS, 2023.
[30] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[31] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment Anything. In ICCV, 2023.
[32] Juil Koo, Seungwoo Yoo, Minh Hieu Nguyen, and Minhyuk Sung. SALAD: Part-Level Latent Diffusion for 3D Shape Generation and Manipulation. In ICCV, 2023.
[33] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular Primitives for High-Performance Differentiable Rendering. ACM TOG, 2020.
[34] Yuseung Lee, Kunho Kim, Hyun** Kim, and Minhyuk Sung. SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions. In NeurIPS, 2023.
[35] Yaron Lipman, David Levin, and Daniel Cohen-Or. Green Coordinates. ACM TOG, 2008.
[36] Yaron Lipman, Olga Sorkine, Daniel Cohen-Or, David Levin, Christian Rössl, and Hans-Peter Seidel. Differential Coordinates for Interactive Mesh Editing. In Proceedings of Shape Modeling International, pages 181–190, 2004.
[37] Yaron Lipman, Olga Sorkine, David Levin, and Daniel Cohen-Or. Linear Rotation-Invariant Coordinates for Meshes. ACM TOG, 2005.
[38] Minghua Liu, Minhyuk Sung, Radomir Mech, and Hao Su. DeepMetaHandles: Learning Deformation Meta-Handles of 3D Meshes with Biharmonic Coordinates. In CVPR, 2021.
[39] Yuan Liu, Cheng Lin, Zijiao Zeng, ** Wang. SyncDreamer: Learning to Generate Multiview-consistent Images from a Single-view Image. arXiv, 2023.
[40] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. Text2Mesh: Text-Driven Neural Stylization for Meshes. In CVPR, 2022.
[41] Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold. ACM TOG, 2023.
[42] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv, 2023.
[43] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. DreamFusion: Text-to-3D using 2D Diffusion. In ICLR, 2023.
[44] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models from Natural Language Supervision. In ICML, 2021.
[45] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Ben Mildenhall, Nataniel Ruiz, Shiran Zada, Kfir Aberman, Michael Rubenstein, Jonathan Barron, Yuanzhen Li, and Varun Jampani. DreamBooth3D: Subject-Driven Text-to-3D Generation. In ICCV, 2023.
[46] Daniel Ritchie. Rudimentary framework for running two-alternative forced choice (2afc) perceptual studies on mechanical turk.
[47] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.
[48] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation. In CVPR, 2023.
[49] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
[50] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis. In NeurIPS, 2021.
[51] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view Diffusion for 3D Generation. arXiv, 2023.
[52] Yujun Shi, Chuhui Xue, Jiachun Pan, Wenqing Zhang, Vincent YF Tan, and Song Bai. DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing. arXiv, 2023.
[53] J. Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3D Neural Field Generation using Triplane Diffusion. In CVPR, 2023.
[54] Olga Sorkine and Marc Alexa. As-Rigid-As-Possible Surface Modeling. Proceedings of EUROGRAPHICS/ACM SIGGRAPH Symposium on Geometry Processing, pages 109–116, 2007.
[55] Olga Sorkine, Daniel Cohen-Or, Yaron Lipman, Marc Alexa, Christian Rössl, and Hans-Peter Seidel. Laplacian Surface Editing. Proceedings of the EUROGRAPHICS/ACM SIGGRAPH Symposium on Geometry Processing, pages 179–188, 2004.
[56] Jiapeng Tang, Markhasin Lev, Wang Bi, Thies Justus, and Matthias Nießner. Neural Shape Deformation Priors. In NeurIPS, 2022.
[57] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. arXiv, 2023.
[58] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In CVPR, 2023.
[59] Yu Wang, Alec Jacobson, Jernej Barbič, and Ladislav Kavan. Linear Subspace Design for Real-Time Shape Deformation. ACM TOG, 2015.
[60] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023.
[61] Ofir Weber, Mirela Ben-Chen, and Craig Gotsman. Complex barycentric coordinates with applications to planar shape deformation. Computer Graphics Forum, 2009.
[62] Ofir Weber, Olga Sorkine, Yaron Lipman, and Craig Gotsman. Context-Aware Skeletal Shape Deformation. Computer Graphics Forum, 2007.
[63] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, Dahua Lin, and Ziwei Liu. OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation. In CVPR, 2023.
[64] Zhan Xu, Yang Zhou, Evangelos Kalogerakis, Chris Landreth, and Karan Singh. RigNet: Neural Rigging for Articulated Characters. ACM TOG, 2020.
[65] Zhan Xu, Yang Zhou, Li Yi, and Evangelos Kalogerakis. Morig: Motion-Aware Rigging of Character Meshes from Point Clouds. In SIGGRAPH ASIA, 2022.
[66] Wang Yifan, Noam Aigerman, Vladimir G. Kim, Chaudhuri Siddhartha, and Olga Sorkine. Neural Cages for Detail-Preserving 3D Deformations. In CVPR, 2020.
[67] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV, 2023.
[68] Yuechen Zhang, **bo Xing, Eric Lo, and Jiaya Jia. Real-World Image Variation by Aligning Diffusion Inversion Chain. In NeurIPS, 2023.
[69] **gyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, and Guanbin Li. DreamEditor: Text-Driven 3D Scene Editing with Neural Fields. ACM TOG, 2023.

Appendix

In this supplementary material, we first describe implementation details of our main pipeline (Sec. A.1) and details of APAP-Bench construction (Sec. A.2). We also provide the exact question given to user study participants, as well as an example questionnaire (Sec. A.3). Furthermore, we summarize additional experimental results including an ablation study for 2D mesh editing based on qualitative results (Sec. A.7), additional qualitative results for 3D shape deformation (Sec. A.4), more complex 3D shape deformations achieved by leveraging classical deformation techniques (Sec. A.5), and human evaluation on the plausibility of 3D deformations via user study (Sec. A.6).

A.1 Implementation Details

We provide additional implementation details of Alg. 1 . We used a modified version of the differentiable Poisson solver from [2], denoted by $g$ in Alg. 1, and nvdiffrast [33] when implementing the differentiable renderer $\mathcal{R}$ in our pipeline. We render 2D/3D meshes at a resolution of $512\times 512$ .

When editing 2D meshes, we optimize $\mathcal{L}_{h}$ for $M=300$ iterations in the FirstStage and jointly optimize $\mathcal{L}_{h}$ and $\mathcal{L}_{\text{SDS}}$ for $N=700$ iterations in the SecondStage. For experiments involving the optimization of 3D meshes with increased geometric complexity, we use $M=300$ and $N=1000$ for each stage, respectively. We use ADAM [30] with a learning rate $\gamma=1\times 10^{-3}$ throughout the optimization. We use the Classifier-Free Guidance (CFG) scale of 100.0 and randomly sample $t\in[0.02,0.98]$ when evaluating $\mathcal{L}_{\text{SDS}}$ following DreamFusion [43].

We use a script from diffusers [10] to finetune Stable Diffusion [47] with LoRA [17]. We employ stabilityai/stable-diffusion-2-1-base as our base model and augment its cross-attention layers in the U-Net with rank decomposition matrices of rank $16$ . For the task of 2D mesh editing, we train the injected parameters for $60$ iterations, utilizing a rendering of a mesh as a training image. In the 3D shape deformation, where renderings from 4 canonical viewpoints (front, back, left, and right) are available, we finetune the model for $200$ iterations. In both cases, we use the learning rate $\gamma=5\times 10^{-4}$ .

A.2 Details of APAP-Bench

Image Generation.

For evaluation purposes, we build APAP-Bench 2D by generating 2 images of real-world objects for each of the 20 categories using Stable Diffusion-XL [42] as noted in the main paper. We segment the foreground objects from the generated images and run Delaunay triangulation to populate a collection of 2D meshes. When generating the images, we use the following template prompt "a photo of [category name] in a white background" for all categories to facilitate foreground object segmentation. Tab. A3 summarizes the list of categories. Note that the list includes both human-made and organic objects that can be easily found in the daily environment to test the generalization capability of a deformation technique to various object types.

Human-Made	Organic
backpack	flying bird
bike	side view of cat
chair	side view of dog
high-heeled shoes	runway model
purse	sitting bird
side view of car	standing cheetah
sneakers	standing dragon
table	standing raccoon
airplane	standing sheep
	standing white duck
	starfish

Table A3: Object categories of 2D meshes in APAP-Bench 2D. APAP-Bench 2D includes 2D triangle meshes depicting various objects, including both human-made and organic objects.

Handle and Anchor Assignment.

We manually assign two handle and anchor pairs to each mesh to imitate user instructions. Specifically, we choose vertices on the shape boundaries instead of internal vertices to induce deformations that alter object silhouettes. For instance, users would try to drag the bottom of a backpack downward to enlarge the shape, instead of dragging an interior point which may flip triangles, distorting the appearance. As an anchor, we use the vertex closest to the center of mass of each mesh.

In experiments using APAP-Bench 3D and APAP-Bench 2D, we note that utilization of neighboring vertices of the given handles and anchors during deformation helps retain smooth geometry near the handle. Therefore, we additionally sample vertices near the handles and anchors that lie in the sphere of radius $r=0.01$ and denote the extended sets of handles and anchors region handles and region anchors, respectively. We use region anchors and a single handle for 3D experiments and region anchors and region handles for 2D cases. Note that we use the same sets of handles and anchors when deforming shapes with our baselines for fair comparisons.

A.3 Details of User Study for 2D Mesh Editing

In Sec. 4 of the main paper, we reported the preference statistics collected from 102 user study participants who passed the vigilance tests. We provide additional details of the user study in the following. We instructed participants to select the most anticipated outcome when the displayed source image is edited by the dragging operation visualized as an arrow with the question: "A visual designer wants to modify the object by clicking on a red point and dragging it in the direction of the arrow. Please choose a result that best satisfies the designer’s edit, while retaining the characteristics and plausibility of the object."

Fig. A6 (left) shows an example of a questionnaire provided to the participants. For vigilance tests, we included an editing result from DragDiffusion [52] depicting an object irrelevant to the source image in each question. The participants were asked to answer the same question. We illustrate an example questionnaire of a vigilance test in Fig. A6 (right).

A.4 Additional Qualitative Results for 3D Shape Deformation

Fig. A7 summarizes outputs of 3D shape deformation with additional results. As reported , ARAP [54] only enforces local rigidity and hence cannot produce smooth deformations intended by users. In the ninth row, ARAP [54] introduces a pointy end given an editing instruction that drags the bottom of a doll downward. Ours, however, elongates the entire geometry smoothly, producing a more visually plausible deformation. Another example displayed in the tenth row shows similar behaviors of ARAP [54] and ours, respectively. Here, unlike ARAP [54], the proposed method adjusts the overall proportion of the statue as the handle located at the tail is translated, while preserving the smooth and round geometry near the handle.

A.5 Complex 3D Deformation Examples

In addition to the ability to optimize Jacobian fields using diffusion priors offered by the linearity of Poisson solvers, we can directly propagate local transforms, additionally defined at handle vertices, to Jacobians of neighboring faces by employing geodesic distances as weights [9]. This allows for more dramatic deformations illustrated in Fig. A8, involving limb articulations, large bending, and the use of multiple handles and anchors. As represented in the Panda (the seventh, eighth, ninth columns) example, our framework can handle large pose variations, useful in downstream applications, such as animation.

Source (View 1)	Ours (View 1)	Ours (View 2)	Source (View 1)	Ours (View 1)	Ours (View 2)	Source (View 1)	Ours (View 1)	Ours (View 2)	Source (View 1)	Ours (View 1)	Ours (View 2)

Figure A8: Examples of deforming source meshes using multiple handles and anchors. Best viewed in Zoom-in.

A.6 User Study Results for 3D Shape Deformation

Assessing the visual plausibility of 3D deformations is particularly challenging due to the difficulty in populating large-scale reference sets as we did for 2D meshes in Sec. 4 . We further note that, unlike 3D generative models, computing image-based metrics such as CLIP-R score is non-trivial since it is hard to describe handle-based deformations solely using text prompts.

Therefore, we conduct a user study similar to the one presented in Sec. 4 . We asked 47 user study participants on Amazon Mechanical Turk (MTurk) to compare rendered images of meshes deformed using ARAP [54] and ours. Each participant is provided with 20 image pairs and asked to select one image at each time given the question: "Which edited image is more realistic and plausible? Choose one of the following images." An example of a questionnaire displayed to the participants is shown in Fig. A9 (left). We provide an example of vigilance tests, similar to the user study for 2D mesh editing, in Fig. A9 (right). As summarized in Tab. A4, the deformation produced by our method is preferred over the results from the baseline.

Methods	Preference (%) $\uparrow$
ARAP [54]	41.7
Ours	58.3

Table A4: User study preference for 3D mesh deformation. In a user study targeting users on Amazon Mechanical Turk (MTurk), the results produced using ours were preferred over the outputs from the baseline.

A.7 Ablation Study for 2D Mesh Editing

In this section, we provide qualitative results from the ablation study to validate the impact of each component on the plausibility of editing results. In Fig. LABEL:fig:suppl_2d_ablation, we summarize the results obtained by (1) optimizing only $\mathcal{L}_{h}$ , (2) $\mathcal{L}_{h}$ and $\mathcal{L}_{\text{SDS}}$ without LoRA finetuning, (3) skip** the FirstStage, (4) using ARAP initialization, (5) using Poisson initialization, and (6) Ours. As mentioned in the main paper, optimizing only $\mathcal{L}_{h}$ (the second column) either distorts texture (the fifth row) or inflates or shrinks other parts of the given shape (the seventh and twelfth row). This demonstrates the necessity of a visual prior during deformation. Also, we observe the cases where skip** the FirstStage (the fourth column) does not lead to intended deformation as our diffusion prior is reluctant to modify shapes from their original states (the first, second, and fifth row). On the other hand, deformations initialized with the meshes produced by ARAP [54] (the fifth column) or Poisson solve (the sixth column) suffer from distortions that could not be resolved by optimizing $\mathcal{L}_{\text{SDS}}$ in the SecondStage.

\CatchFileDef\AllComparisonImages

Figures/supp/ablation/image_list.tex

Table A5: Ablation study for 2D mesh editing. We examine the impact of each design choice on deformation outputs, including the use of diffusion prior (the second column), LoRA finetuning (the third column), two-stage pipeline (the fourth column), and initialization strategies during the FirstStage (the fifth and sixth column).



Source	$\mathcal{L}_{h}$ Only	No LoRA [17]	SecondStage Only	ARAP Init.	Poisson Init.	Ours
\AllComparisonImages

View 1			View 2			View 2 (Zoom In)
Source	ARAP [54]	Ours	Source	ARAP [54]	Ours	Source	ARAP [54]	Ours

As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors

Abstract

1 Introduction

2 Related Work

2.1 Geometric Mesh Deformation

2.2 Data-Driven Mesh Deformation

2.3 Pretrained 2D Priors for Shape Manipulation

3 Method

3.1 Representing Shapes as Jacobian Fields

3.2 Score Distillation for Shape Deformation

3.3 As-Plausible-As-Possible (APAP)

4 Experiments

4.1 Experiment Setup

Benchmark.

Baselines.

Evaluation Metrics.

4.2 3D Shape Deformation

Qualitative Results.

4.3 2D Mesh Editing

Qualitative Evaluation.

Quantitative Evaluation.

User Study.

Ablation Study.

5 Conclusion

Acknowledgements

References

Appendix

A.1 Implementation Details

A.2 Details of APAP-Bench

Image Generation.

Handle and Anchor Assignment.

A.3 Details of User Study for 2D Mesh Editing

A.4 Additional Qualitative Results for 3D Shape Deformation

A.5 Complex 3D Deformation Examples

A.6 User Study Results for 3D Shape Deformation

A.7 Ablation Study for 2D Mesh Editing

As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation
Using 2D Diffusion Priors