As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation
Using 2D Diffusion Priors

Seungwoo Yoo*1  Kunho Kim*1  Vladimir G. Kim2  Minhyuk Sung1
1KAIST     2Adobe Research
Abstract

We present As-Plausible-as-Possible (APAP) mesh deformation technique that leverages 2D diffusion priors to preserve the plausibility of a mesh under user-controlled deformation. Our framework uses per-face Jacobians to represent mesh deformations, where mesh vertex coordinates are computed via a differentiable Poisson Solve. The deformed mesh is rendered, and the resulting 2D image is used in the Score Distillation Sampling (SDS) process, which enables extracting meaningful plausibility priors from a pretrained 2D diffusion model. To better preserve the identity of the edited mesh, we fine-tune our 2D diffusion model with LoRA. Gradients extracted by SDS and a user-prescribed handle displacement are then backpropagated to the per-face Jacobians, and we use iterative gradient descent to compute the final deformation that balances between the user edit and the output plausibility. We evaluate our method with 2D and 3D meshes and demonstrate qualitative and quantitative improvements when using plausibility priors over geometry-preservation or distortion-minimization priors used by previous techniques. Our project page is at: https://as-plausible-as-possible.github.io/

[Uncaptioned image]
Figure 1: APAP, our novel shape deformation method, enables plausibility-aware mesh deformation and preservation of fine details of the original mesh offering an interface that alters geometry by directly displacing a handle (red) along a direction (gray), fixing an anchor vertex (green). Using a diffusion prior results in smoother geometry around the armchair handle, as seen in the example (middle column).
**footnotetext: Equal contribution.

1 Introduction

For 2D and 3D content, mesh is the most prevalent representation, thanks to its efficiency in storage, simplicity in rendering and also compatibility in common graphics pipelines, versatility in diverse applications such as design, physical simulation, and 3D printing, and flexibility in terms of decomposing geometry and appearance information, with widespread adoption in the industry.

For the creation of 2D and 3D meshes, recent breakthroughs in generative models [50, 53, 32, 43, 60, 51, 39, 57] have demonstrated significant advances. These breakthroughs enable users to easily generate content from a text prompt [43, 60, 51, 39, 57], or from photos [51, 45]. However, visual content creation typically involves numerous editing processes, deforming the content to satisfy users’ desires through interactions such as mouse clicks and drags. Facilitating such interactive editing has remained relatively underexplored in the context of recent generative techniques.

Mesh deformation is a subject that has been researched for decades in computer graphics. Over time, researchers have established well-defined methodologies, characterizing mesh deformation as an optimization problem that aims to preserve specific geometric properties, such as the Mesh Laplacian [36, 55, 37], local rigidity [18, 54], and mesh surface Jacobians [2, 13], while satisfying given constraints. To facilitate user interaction, these methodologies have been extended to introduce specific user-interactive deformation handles, such as keypoints [20, 59, 28], cage mesh [26, 25, 35, 61, 66, 23], and skeleton [4, 64, 65], with the blending functions defined based on the preservation of geometric properties.

Despite the widespread use of classical mesh deformation methods, they often fail to meet users’ needs because they do not incorporate the perceptual plausibility of the outputs. For example, as illustrated in Fig. 1, when a user intends to drag a point on the top of a table image, the classical deformation technique may introduce unnatural bending instead of lifting the tabletop. This limitation arises because deformation techniques solely based on geometric properties do not incorporate such semantic and perceptual priors, resulting in the mesh editing process becoming more tedious and time-consuming.

Recent learning-based mesh deformation techniques [66, 23, 28, 64, 38, 2, 56] have attempted to address this problem in a data-driven way. However, they are also limited by relying on the existence of certain variations in the training data. Even recent large-scale 3D datasets [6, 63, 8, 7] have not reached the scale that covers all possible visual content users might intend to create.

To this end, we introduce our novel mesh deformation framework, dubbed APAP (As-Plausible-As-Possible), which exploits 2D image priors from a diffusion model pretrained on an Internet-scale image dataset to enhance the plausibility of deformed 2D and 3D meshes while preserving the geometric priors of the given shape. Recently, score distillation sampling (SDS) [43] has demonstrated great success in generating plausible 2D and 3D content, such as NeRF [69, 29, 24] and vector images [22, 19], using the distilled 2D image priors from a diffusion model. We incorporate these diffusion-model-based 2D priors into the optimization-based deformation framework, achieving the best synergy between geometry-based optimization and distilled-prior-based optimization.

To achieve this optimal synergy between geometric and perceptual priors within a unified framework, we introduce an alternative optimization approach. At each step, we first update the Jacobian of each mesh face using the SDS loss and user-provided constraints. Subsequently, the mesh vertex positions are recalculated by solving Poisson’s equation with the updated face Jacobians. The direct application of the 2D diffusion prior via SDS, however, tends to compromise the identity of the given objects—an essential aspect in deformation. We thus enhance the identity awareness of the diffusion prior by finetuning it with the provided source image. The model is integrated into our two-stage pipeline that initiates deformation without the perceptual prior (SDS) and refines it with SDS and the given constraints afterward to create deformations that adhere to user-defined editing instructions while remaining visually plausible.

In experiments, we examine APAP using APAP-Bench consisting of 3D and 2D triangular meshes and editing instructions. The proposed method produces plausible deformations of 3D meshes compared to its baseline [54] based exclusively on a geometric prior. Evaluation in the task of 2D mesh editing further verifies the effectiveness of APAP as illustrated by the highest k𝑘kitalic_k-NN GIQA score [14] in quantitative analysis, and the higher preference over the baseline in a user study.

2 Related Work

2.1 Geometric Mesh Deformation

Mesh deformation has been one of the central problems in geometry processing and is thus addressed by a wide range of techniques. Cage-based methods [26, 25, 35, 61] let users alter meshes by manipulating cages enclosing them, calculating a point inside as a weighted sum of cage vertices. Skeleton-based approaches [62, 4, 64, 65] offer animation control by map** surface points to underlying joints and bones, ideal for animating human/animal-like figures. Unlike the previous techniques that require the manual cage or skeleton construction, biharmonic coordinates-based methods [20, 59] automate establishing map**s from control points to vertices by formulating optimization problems. Other types of works instead allow users to manipulate shapes via direct vertex displacement while imposing constraints on local surface geometry, including rigidity [18, 54] and Laplacian smoothness [36, 55, 37]. Such hand-crafted deformation priors often lack consideration of visual plausibility, necessitating careful control point placement and iterative manual refinement to achieve satisfactory results.

2.2 Data-Driven Mesh Deformation

Data-driven approaches to mesh deformation [66, 23, 28, 64, 38, 2, 56] learn from shape collections, utilizing neural networks to infer parameters for classical deformation techniques, such as cage vertex coordinates and displacements [66], keypoints [23, 59, 28], subspaces of keypoint arrangements [38], differential coordinates [2], etc. However, these methods assume the availability of large-scale category-specific shape collection [66, 23, 59, 28, 64] or require dense correspondences between them [56, 2], limiting their applicability to new, out-of-sample shapes. We instead propose to directly mine deformation priors from pretrained diffusion models. Leveraging a generic (category-agnostic) image generative model trained on an Internet-scale image dataset, we devise a method that easily generalizes to novel 2D and 3D shapes while lifting the requirement for shape collections.

2.3 Pretrained 2D Priors for Shape Manipulation

Image analysis [44] and generation [47, 67, 3, 34] techniques can serve as effective visual priors for image editing tasks [5, 16, 58, 68, 52]. In addition, recent work [12, 48] and their adaption [11], enable personalized image generation and editing by learning a text embedding [12] or fine-tuning additional parameters, such as LoRA [17] to preserve and replicate the identities of given exemplars during editing. One interesting work is DragDiffusion [52], akin to DragGAN [41], which introduces a drag-based user interface for image editing through the manipulation of latent representations. However, it is not extendable to the deformation of parametric images, such as 2D meshes, and also 3D shapes. Another interesting line of works  [40, 27, 13] extends the idea further to manipulate shapes by propagating image-based gradients to the underlying shape representations. They maximize CLIP [44] similarity between the renderings and text prompts to either add geometric textures [40], jointly update both vertices and texture [27], or deform a shape parameterized by per-triangle Jacobians [13]. In contrast to such text-driven editing techniques, we build on Score Distillation Sampling (SDS) [43] to enable direct manipulation of shapes via handle displacement, ensuring visual plausibility. While the technique is prevalent in various problems ranging from text-to-3D [43, 60, 51, 39, 57], image editing [15] and neural field editing [69], it has not been adopted for shape deformation.

3 Method

We present APAP, a novel handle-based mesh deformation framework capable of producing visually plausible deformations of either 2D or 3D triangular meshes. To achieve this goal, we integrate powerful 2D diffusion priors into a learnable Jacobian field representation of shapes.

We emphasize that leveraging 2D priors, such as latent diffusion models (LDMs) [47] trained on large-scale datasets [49], for shape deformation poses challenges that require meticulous design choices. The following sections will delve into the details of shape representation (Sec. 3.1) and diffusion prior (Sec. 3.2), offering a rationale for the design decisions underpinning our framework (Sec. 3.3).

3.1 Representing Shapes as Jacobian Fields

Let 0=(𝐕0,𝐅0)subscript0subscript𝐕0subscript𝐅0\mathcal{M}_{0}=\left(\mathbf{V}_{0},\mathbf{F}_{0}\right)caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( bold_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) denote a source mesh to be deformed, represented by vertices 𝐕0V×3subscript𝐕0superscript𝑉3\mathbf{V}_{0}\in\mathbb{R}^{V\times 3}bold_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × 3 end_POSTSUPERSCRIPT and faces 𝐅0F×3subscript𝐅0superscript𝐹3\mathbf{F}_{0}\in\mathbb{R}^{F\times 3}bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × 3 end_POSTSUPERSCRIPT. Users are allowed to select a set of vertices used as movable handles designated by an indicator matrix 𝐊h{0,1}Vh×Vsubscript𝐊superscript01subscript𝑉𝑉\mathbf{K}_{h}\in\{0,1\}^{V_{h}\times V}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_V end_POSTSUPERSCRIPT. We also require users to select a set of anchors, represented as another indicator matrix 𝐊a{0,1}Va×Vsubscript𝐊𝑎superscript01subscript𝑉𝑎𝑉\mathbf{K}_{a}\in\{0,1\}^{V_{a}\times V}bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT × italic_V end_POSTSUPERSCRIPT, to avoid trivial solutions (i.e., global translations). Then, the handle and anchor vertices become 𝐕h=𝐊h𝐕0subscript𝐕subscript𝐊subscript𝐕0\mathbf{V}_{h}=\mathbf{K}_{h}\mathbf{V}_{0}bold_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐕a=𝐊a𝐕0subscript𝐕𝑎subscript𝐊𝑎subscript𝐕0\mathbf{V}_{a}=\mathbf{K}_{a}\mathbf{V}_{0}bold_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Our framework also expects a set of vectors 𝐃hVh×3subscript𝐃superscriptsubscript𝑉3\mathbf{D}_{h}\in\mathbb{R}^{V_{h}\times 3}bold_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT that indicate the directions along which the handles will be displaced. Furthermore, we let 𝐓h=𝐕h+𝐃hsubscript𝐓subscript𝐕subscript𝐃\mathbf{T}_{h}=\mathbf{V}_{h}+\mathbf{D}_{h}bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + bold_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and 𝐓a=𝐕asubscript𝐓𝑎subscript𝐕𝑎\mathbf{T}_{a}=\mathbf{V}_{a}bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT denote the target positions of the user-specified handles and anchors, respectively.

In this work, we employ a Jacobian field 𝐉0={𝐉0,f|f𝐅0}subscript𝐉0conditional-setsubscript𝐉0𝑓𝑓subscript𝐅0\mathbf{J}_{0}=\{\mathbf{J}_{0,f}|f\in\mathbf{F}_{0}\}bold_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { bold_J start_POSTSUBSCRIPT 0 , italic_f end_POSTSUBSCRIPT | italic_f ∈ bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }, a dual representation of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, defined as a set of per-face Jacobians 𝐉0,f3×3subscript𝐉0𝑓superscript33\mathbf{J}_{0,f}\in\mathbb{R}^{3\times 3}bold_J start_POSTSUBSCRIPT 0 , italic_f end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT where

𝐉0,fsubscript𝐉0𝑓\displaystyle\mathbf{J}_{0,f}bold_J start_POSTSUBSCRIPT 0 , italic_f end_POSTSUBSCRIPT =f𝐕0,absentsubscriptbold-∇𝑓subscript𝐕0\displaystyle=\boldsymbol{\nabla}_{f}\mathbf{V}_{0},= bold_∇ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (1)

and fsubscriptbold-∇𝑓\boldsymbol{\nabla}_{f}bold_∇ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the gradient operator of triangle f𝑓fitalic_f.

Conversely, we compute a set of deformed vertices 𝐕superscript𝐕\mathbf{V}^{*}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from a given Jacobian field 𝐉𝐉\mathbf{J}bold_J by solving a Poisson’s equation

𝐕=argmin𝐕𝐋𝐕T𝒜𝐉2,superscript𝐕subscriptargmin𝐕superscriptnorm𝐋𝐕superscriptbold-∇𝑇𝒜𝐉2\displaystyle\mathbf{V}^{*}=\operatorname*{arg\,min}_{\mathbf{V}}\|\mathbf{L}% \mathbf{V}-\boldsymbol{\nabla}^{T}\mathcal{A}\mathbf{J}\|^{2},bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT ∥ bold_LV - bold_∇ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_A bold_J ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (2)

where bold-∇\boldsymbol{\nabla}bold_∇ is a stack of per-face gradient operators, 𝒜3F×3F𝒜superscript3𝐹3𝐹\mathcal{A}\in\mathbb{R}^{3F\times 3F}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_F × 3 italic_F end_POSTSUPERSCRIPT is the mass matrix and 𝐋V×V𝐋superscript𝑉𝑉\mathbf{L}\in\mathbb{R}^{V\times V}bold_L ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_V end_POSTSUPERSCRIPT is the cotangent Laplacian of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, respectively. Since 𝐋𝐋\mathbf{L}bold_L is rank-deficient, the solution of Eqn. 2 cannot be uniquely determined unless we impose constraints. We thus consider a constrained optimization problem

𝐕=argmin𝐕𝐋𝐕T𝒜𝐉2+λ𝐊a𝐕𝐓a2,superscript𝐕subscriptargmin𝐕superscriptnorm𝐋𝐕superscriptbold-∇𝑇𝒜𝐉2𝜆superscriptnormsubscript𝐊𝑎𝐕subscript𝐓𝑎2\displaystyle\mathbf{V}^{*}=\operatorname*{arg\,min}_{\mathbf{V}}\|\mathbf{L}% \mathbf{V}-\boldsymbol{\nabla}^{T}\mathcal{A}\mathbf{J}\|^{2}+\lambda\|\mathbf% {K}_{a}\mathbf{V}-\mathbf{T}_{a}\|^{2},bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT ∥ bold_LV - bold_∇ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_A bold_J ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT bold_V - bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

where λ+𝜆superscript\lambda\in\mathbb{R}^{+}italic_λ ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a weight for the constraint term. Note that we solve Eqn. 3 with the user-specified anchors as constraints to determine 𝐕superscript𝐕\mathbf{V}^{*}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Taking the derivative with respect to 𝐕𝐕\mathbf{V}bold_V, the problem in Eqn. 3 turns into a system of equations

(𝐋T𝐋+λ𝐊aT𝐊a)𝐕=𝐋TT𝒜𝐉+λ𝐊aTTa,superscript𝐋𝑇𝐋𝜆superscriptsubscript𝐊𝑎𝑇subscript𝐊𝑎𝐕superscript𝐋𝑇superscriptbold-∇𝑇𝒜𝐉𝜆superscriptsubscript𝐊𝑎𝑇subscriptT𝑎\displaystyle\left(\mathbf{L}^{T}\mathbf{L}+\lambda\mathbf{K}_{a}^{T}\mathbf{K% }_{a}\right)\mathbf{V}=\mathbf{L}^{T}\boldsymbol{\nabla}^{T}\mathcal{A}\mathbf% {J}+\lambda\mathbf{K}_{a}^{T}\textbf{T}_{a},( bold_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_L + italic_λ bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) bold_V = bold_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_∇ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_A bold_J + italic_λ bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , (4)

which can be efficiently solved using a differentiable solver [2] implementing Cholesky decomposition.

We let g𝑔gitalic_g denote a functional representing the aforementioned differentiable solver for notational convenience, 𝐕=g(𝐉,𝐊a,𝐓a)superscript𝐕𝑔𝐉subscript𝐊𝑎subscript𝐓𝑎\mathbf{V}^{*}=g\left(\mathbf{J},\mathbf{K}_{a},\mathbf{T}_{a}\right)bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_g ( bold_J , bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ). Since g𝑔gitalic_g is differentiable, we can deform 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by propagating upstream gradients from various loss functions to the underlying parameterization 𝐉𝐉\mathbf{J}bold_J. For instance, one may impose a soft constraint on the locations of selected handles during optimization with the objective of the form:

h=𝐊h𝐕𝐓h2.subscriptsuperscriptnormsubscript𝐊superscript𝐕subscript𝐓2\displaystyle\mathcal{L}_{h}=\|\mathbf{K}_{h}\mathbf{V}^{*}-\mathbf{T}_{h}\|^{% 2}.caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = ∥ bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5)

We will discuss how such a soft constraint can be blended into our framework in Sec. 3.3. Next, we describe how to incorporate a pretrained diffusion model as a prior for visual plausibility.

Refer to caption
Figure 2: The overview of APAP. APAP parameterizes a triangular mesh as a per-face Jacobian field that can be updated via gradient-descent. Given a textured mesh and user inputs specifying the handle(s) and anchor(s), our framework initializes a Jacobian field as a trainable parameter. During the first stage, the Jacobian field is updated via iterative optimization of hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, a soft constraint that initially deforms the shape according to the user’s instruction. In the following stage, the mesh is rendered using a differentiable renderer \mathcal{R}caligraphic_R and the rendered image is provided as an input to a diffusion prior finetuned with LoRA [17] that computes the SDS loss SDSsubscriptSDS\mathcal{L}_{\text{SDS}}caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT. The joint optimization of hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and SDSsubscriptSDS\mathcal{L}_{\text{SDS}}caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT improves the visual plausibility of the mesh while conforming to the given edit instruction.

3.2 Score Distillation for Shape Deformation

While traditional mesh deformation techniques make variations that match the given geometric constraints, their lack of consideration on visual plausibility results in unrealistic shapes. Motivated by recent success in text-to-3D literature, we harness a powerful 2D diffusion prior [47] in our framework as a critic that directs deformation by scoring the realism of the current shape.

Specifically, we distill its prior knowledge via Score Distillation Sampling (SDS) [43]. Let 𝐉𝐉\mathbf{J}bold_J denote the current Jacobian field and 𝐕superscript𝐕\mathbf{V}^{*}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the set of vertices computed from 𝐉𝐉\mathbf{J}bold_J following the procedure described in Sec. 3.1.

We render =(𝐕,𝐅)superscriptsuperscript𝐕𝐅\mathcal{M}^{*}=\left(\mathbf{V}^{*},\mathbf{F}\right)caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_F ) from a viewpoint defined by camera extrinsic parameters 𝐂𝐂\mathbf{C}bold_C using a differentiable renderer \mathcal{R}caligraphic_R, producing an image =(,𝐂)superscript𝐂\mathcal{I}=\mathcal{R}\left(\mathcal{M}^{*},\mathbf{C}\right)caligraphic_I = caligraphic_R ( caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_C ). The diffusion prior ϵ^ϕsubscript^italic-ϵitalic-ϕ\hat{\epsilon}_{\phi}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT then rates the realism of \mathcal{I}caligraphic_I, producing a gradient

𝐉SDS(ϕ,)=𝔼t,ϵ[w(t)(ϵ^ϕ(𝐳t;y,t)ϵ)𝐉],subscript𝐉subscriptSDSitalic-ϕsubscript𝔼𝑡italic-ϵdelimited-[]𝑤𝑡subscript^italic-ϵitalic-ϕsubscript𝐳𝑡𝑦𝑡italic-ϵ𝐉\displaystyle\nabla_{\mathbf{J}}\mathcal{L}_{\text{SDS}}\left(\phi,\mathcal{I}% \right)=\mathbb{E}_{t,\epsilon}\left[w\left(t\right)\left(\hat{\epsilon}_{\phi% }\left(\mathbf{z}_{t};y,t\right)-\epsilon\right)\frac{\partial\mathcal{I}}{% \partial\mathbf{J}}\right],∇ start_POSTSUBSCRIPT bold_J end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT ( italic_ϕ , caligraphic_I ) = blackboard_E start_POSTSUBSCRIPT italic_t , italic_ϵ end_POSTSUBSCRIPT [ italic_w ( italic_t ) ( over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_y , italic_t ) - italic_ϵ ) divide start_ARG ∂ caligraphic_I end_ARG start_ARG ∂ bold_J end_ARG ] , (6)

where t𝒰(0,1)similar-to𝑡𝒰01t\sim\mathcal{U}\left(0,1\right)italic_t ∼ caligraphic_U ( 0 , 1 ), ϵ𝒩(𝟎,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ), and 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a noisy latent embedding of \mathcal{I}caligraphic_I. The propagated gradient alters the geometry of \mathcal{M}caligraphic_M by modifying 𝐉𝐉\mathbf{J}bold_J.

To increase the instance-awareness of the diffusion model, we follow recent work [48, 52] on personalized image editing and finetune the model using LoRA [17]. In particular, we first render \mathcal{M}caligraphic_M from n𝑛nitalic_n different viewpoints to obtain a set ={1,,n}subscript1subscript𝑛\mathcal{I}=\{\mathcal{I}_{1},\dots,\mathcal{I}_{n}\}caligraphic_I = { caligraphic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } of training images and inject additional parameters to the model, resulting in an expanded set of network parameters ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The parameters are then optimized with a denoising loss [47]

=𝔼t,ϵ,𝐳[ϵ^ϕ(𝐳t;y,t)ϵ2],subscript𝔼𝑡italic-ϵ𝐳delimited-[]superscriptnormsubscript^italic-ϵsuperscriptitalic-ϕsubscript𝐳𝑡𝑦𝑡italic-ϵ2\displaystyle\mathcal{L}=\mathbb{E}_{t,\epsilon,\mathbf{z}}\left[\|\hat{% \epsilon}_{\phi^{\prime}}\left(\mathbf{z}_{t};y,t\right)-\epsilon\|^{2}\right],caligraphic_L = blackboard_E start_POSTSUBSCRIPT italic_t , italic_ϵ , bold_z end_POSTSUBSCRIPT [ ∥ over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_y , italic_t ) - italic_ϵ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (7)

where 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes a latent of a training image perturbed with noise at timestep t𝑡titalic_t.

The finetuned diffusion prior, together with a learnable Jacobian field representation of the source mesh 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, comprises the proposed framework described in the following section.

3.3 As-Plausible-As-Possible (APAP)

Algorithm 1 As-Plausible-As-Possible
Parameters: g𝑔gitalic_g, \mathcal{R}caligraphic_R, ϕitalic-ϕ\phiitalic_ϕ, γ𝛾\gammaitalic_γ, M𝑀Mitalic_M, N𝑁Nitalic_N
Inputs: 0=(𝐕0,𝐅0)subscript0subscript𝐕0subscript𝐅0\mathcal{M}_{0}=\left(\mathbf{V}_{0},\mathbf{F}_{0}\right)caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( bold_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), 𝐊asubscript𝐊𝑎\mathbf{K}_{a}bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐊hsubscript𝐊\mathbf{K}_{h}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, 𝐓asubscript𝐓𝑎\mathbf{T}_{a}bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐓hsubscript𝐓\mathbf{T}_{h}bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, {𝐂i}i=1nsuperscriptsubscriptsubscript𝐂𝑖𝑖1𝑛\left\{\mathbf{C}_{i}\right\}_{i=1}^{n}{ bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
Output: \mathcal{M}caligraphic_M
procedure FirstStage(𝐉𝐉\mathbf{J}bold_J, 𝐊asubscript𝐊𝑎\mathbf{K}_{a}bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐊hsubscript𝐊\mathbf{K}_{h}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, 𝐓asubscript𝐓𝑎\mathbf{T}_{a}bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐓hsubscript𝐓\mathbf{T}_{h}bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, g𝑔gitalic_g)
     for i=1,2,,M𝑖12𝑀i=1,2,\dots,Mitalic_i = 1 , 2 , … , italic_M do
         𝐕superscript𝐕\mathbf{V}^{*}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT \leftarrow g(𝐉,𝐊a,𝐓a)𝑔𝐉subscript𝐊𝑎subscript𝐓𝑎g\left(\mathbf{J},\mathbf{K}_{a},\mathbf{T}_{a}\right)italic_g ( bold_J , bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) \triangleright Solving Eqn. 4
         𝐉𝐉\mathbf{J}bold_J \leftarrow 𝐉γ𝐉h(𝐕,𝐊h,𝐓h)𝐉𝛾subscript𝐉subscriptsuperscript𝐕subscript𝐊subscript𝐓\mathbf{J}-\gamma\nabla_{\mathbf{J}}\mathcal{L}_{h}\left(\mathbf{V}^{*},% \mathbf{K}_{h},\mathbf{T}_{h}\right)bold_J - italic_γ ∇ start_POSTSUBSCRIPT bold_J end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
     end for
     return 𝐉𝐉\mathbf{J}bold_J
end procedure
procedure SecondStage(𝐉𝐉\mathbf{J}bold_J, 𝐅0subscript𝐅0\mathbf{F}_{0}bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝐊asubscript𝐊𝑎\mathbf{K}_{a}bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐊hsubscript𝐊\mathbf{K}_{h}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, 𝐓asubscript𝐓𝑎\mathbf{T}_{a}bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐓hsubscript𝐓\mathbf{T}_{h}bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, g𝑔gitalic_g, ϕitalic-ϕ\phiitalic_ϕ, {𝐂i}subscript𝐂𝑖\left\{\mathbf{C}_{i}\right\}{ bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT })
     for i=1,2,,N𝑖12𝑁i=1,2,\dots,Nitalic_i = 1 , 2 , … , italic_N do
         𝐕superscript𝐕\mathbf{V}^{*}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT \leftarrow g(𝐉,𝐊a,𝐓a)𝑔𝐉subscript𝐊𝑎subscript𝐓𝑎g\left(\mathbf{J},\mathbf{K}_{a},\mathbf{T}_{a}\right)italic_g ( bold_J , bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) \triangleright Solving Eqn. 4
         superscript\mathcal{M}^{*}caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT \leftarrow (𝐕,𝐅0)superscript𝐕subscript𝐅0\left(\mathbf{V}^{*},\mathbf{F}_{0}\right)( bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
         𝐂𝐂\mathbf{C}bold_C similar-to\sim 𝒰𝒰\mathcal{U}caligraphic_U({𝐂i}subscript𝐂𝑖\left\{\mathbf{C}_{i}\right\}{ bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }) \triangleright Viewpoint Sampling
         \mathcal{I}caligraphic_I \leftarrow (,𝐂)superscript𝐂\mathcal{R}\left(\mathcal{M}^{*},\mathbf{C}\right)caligraphic_R ( caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_C ) \triangleright Rendering
         𝐉𝐉\mathbf{J}bold_J \leftarrow 𝐉γ𝐉(SDS(ϕ,)+h(𝐕,𝐊h,𝐓h))𝐉𝛾subscript𝐉subscriptSDSitalic-ϕsubscriptsuperscript𝐕subscript𝐊subscript𝐓\mathbf{J}-\gamma\nabla_{\mathbf{J}}\left(\mathcal{L}_{\text{SDS}}\left(\phi,% \mathcal{I}\right)+\mathcal{L}_{h}\left(\mathbf{V}^{*},\mathbf{K}_{h},\mathbf{% T}_{h}\right)\right)bold_J - italic_γ ∇ start_POSTSUBSCRIPT bold_J end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT ( italic_ϕ , caligraphic_I ) + caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) )
     end for
     return 𝐉𝐉\mathbf{J}bold_J
end procedure
ϕitalic-ϕ\phiitalic_ϕ \leftarrow LoRA(ϕitalic-ϕ\phiitalic_ϕ, 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, \mathcal{R}caligraphic_R, {𝐂i}subscript𝐂𝑖\left\{\mathbf{C}_{i}\right\}{ bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT })
𝐉𝐉\mathbf{J}bold_J \leftarrow {𝐉0,f|f𝐅0}conditional-setsubscript𝐉0𝑓𝑓subscript𝐅0\{\mathbf{J}_{0,f}|f\in\mathbf{F}_{0}\}{ bold_J start_POSTSUBSCRIPT 0 , italic_f end_POSTSUBSCRIPT | italic_f ∈ bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }
𝐉𝐉\mathbf{J}bold_J \leftarrow FirstStage(𝐉𝐉\mathbf{J}bold_J, 𝐊asubscript𝐊𝑎\mathbf{K}_{a}bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐊hsubscript𝐊\mathbf{K}_{h}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, 𝐓asubscript𝐓𝑎\mathbf{T}_{a}bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐓hsubscript𝐓\mathbf{T}_{h}bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, g𝑔gitalic_g)
𝐉𝐉\mathbf{J}bold_J \leftarrow SecondStage(𝐉𝐉\mathbf{J}bold_J, 𝐅0subscript𝐅0\mathbf{F}_{0}bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝐊asubscript𝐊𝑎\mathbf{K}_{a}bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐊hsubscript𝐊\mathbf{K}_{h}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, 𝐓asubscript𝐓𝑎\mathbf{T}_{a}bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, 𝐓hsubscript𝐓\mathbf{T}_{h}bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, g𝑔gitalic_g, ϕitalic-ϕ\phiitalic_ϕ, {𝐂i}subscript𝐂𝑖\left\{\mathbf{C}_{i}\right\}{ bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT })
𝐕𝐕\mathbf{V}bold_V \leftarrow g(𝐉,𝐊a,𝐓a)𝑔𝐉subscript𝐊𝑎subscript𝐓𝑎g\left(\mathbf{J},\mathbf{K}_{a},\mathbf{T}_{a}\right)italic_g ( bold_J , bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT )
\mathcal{M}caligraphic_M \leftarrow (𝐕,𝐅0)𝐕subscript𝐅0\left(\mathbf{V},\mathbf{F}_{0}\right)( bold_V , bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
return \mathcal{M}caligraphic_M

APAP tackles the problem of plausibility-aware shape deformation by harmonizing the best of both worlds: a learnable shape representation founded on classical geometry processing, robust to noisy gradients, and a powerful 2D diffusion prior finetuned with the image(s) of the source mesh for better instance-awareness.

We provide an overview of the proposed pipeline in Fig. 2 and the algorithm in Alg. 1. We will delve into details in the following. Provided with a textured mesh 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, handles 𝐊hsubscript𝐊\mathbf{K}_{h}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, anchors 𝐊asubscript𝐊𝑎\mathbf{K}_{a}bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, as well as their target positions 𝐓hsubscript𝐓\mathbf{T}_{h}bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and 𝐓asubscript𝐓𝑎\mathbf{T}_{a}bold_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as inputs, APAP yields a plausible deformation \mathcal{M}caligraphic_M of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that conforms to the given handle-target constraints. Before deforming 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we render 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from a single view in the case of 2D meshes and four canonical views (i.e., front, back, left, and right) for 3D meshes and use the images to finetune Stable Diffusion [47] by optimizing LoRA [17] parameters injected to the model (the red line in Fig. 2). Simultaneously, APAP computes the Jacobian field 𝐉0subscript𝐉0\mathbf{J}_{0}bold_J start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of the input mesh 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and initializes it as a trainable parameter 𝐉𝐉\mathbf{J}bold_J.

APAP deforms the input mesh through two stages. In the FirstStage, it first deforms the input mesh according to instructions from users without taking visual plausibility into account. The subsequent SecondStage integrates a 2D diffusion prior into the optimization loop, simultaneously enforcing user constraints and visual plausibility.

At every iteration of the FirstStage illustrated as the blue box in Fig. 2, we compute the vertex positions 𝐕superscript𝐕\mathbf{V}^{*}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT corresponding to the current Jacobian field 𝐉𝐉\mathbf{J}bold_J by solving Eqn. 3 using the anchors specified by 𝐊asubscript𝐊𝑎\mathbf{K}_{a}bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as hard constraints. Then, we compute the soft constraint hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT defined as Eqn. 5 that drags a set of handle vertices 𝐊h𝐕subscript𝐊superscript𝐕\mathbf{K}_{h}\mathbf{V}^{*}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT toward the corresponding targets 𝐓hsubscript𝐓\mathbf{T}_{h}bold_T start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. The interleaving of differentiable Poisson solve and optimization of hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT via gradient-descent is repeated for M𝑀Mitalic_M iterations. This progressively updates 𝐉𝐉\mathbf{J}bold_J, treated as a learnable black box in our framework, deforming 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Consequently, the edited mesh =(𝐉,𝐅0)superscript𝐉subscript𝐅0\mathcal{M}^{*}=\left(\mathbf{J},\mathbf{F}_{0}\right)caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( bold_J , bold_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) follows user constraints at the cost of the degraded plausibility, mitigated in the following stage through the incorporation of a diffusion prior.

The result of FirstStage then serves as an initialization for the SecondStage, illustrated as the green box in Fig. 2 guided by plausibility constraint SDSsubscriptSDS\mathcal{L}_{\text{SDS}}caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT. Unlike the FirstStage where the update of 𝐉𝐉\mathbf{J}bold_J was purely driven by the geometric constraint hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, we aim to steer the optimization based on the visual plausibility of the current mesh superscript\mathcal{M}^{*}caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. To achieve this, we render superscript\mathcal{M}^{*}caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT using a differentiable renderer \mathcal{R}caligraphic_R using the same viewpoint(s) from which the training image(s) for finetuning was rendered. When deforming 3D meshes, we randomly sample one viewpoint at each iteration. The rendered image \mathcal{I}caligraphic_I is used to evaluate SDSsubscriptSDS\mathcal{L}_{\text{SDS}}caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT which is optimized jointly with hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for N𝑁Nitalic_N iterations. The combination of geometric and plausibility constraints improves the visual plausibility of the output while encouraging it to conform to the given constraints.

We note that the iterative approach in the FirstStage leads to better results than alternative update strategies such as deforming the source mesh 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by minimizing ARAP energy [54] or, solving Eqn. 3 using both 𝐊hsubscript𝐊\mathbf{K}_{h}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and 𝐊asubscript𝐊𝑎\mathbf{K}_{a}bold_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as hard constraints. In our experiments (Sec. 4), we show that both methods produce distortions that cannot be corrected by the diffusion prior in the subsequent stage. Specifically, directly solving Eqn. 3 using all available constraints only yields the least squares solution 𝐕superscript𝐕\mathbf{V}^{*}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT without updating the underlying Jacobians 𝐉𝐉\mathbf{J}bold_J, resulting in the aforementioned distortions.

4 Experiments

We evaluate APAP in downstream applications involving manipulation of 3D and 2D meshes.

4.1 Experiment Setup

Benchmark.

To evaluate the plausibility of a mesh deformation we propose a novel benchmark APAP-Bench of textured 3D and 2D triangular meshes spanning both human-made and organic objects annotated with handle vertices and their editing directions, and anchor vertices. The set of 3D meshes, APAP-Bench 3D, is constructed using meshes from ShapeNet [6] and Genie [1]. The meshes are normalized to fit in a unit cube. Each mesh is manually annotated with editing instructions, including a set of anchors, handles, and corresponding targets to simulate editing scenarios. APAP-Bench offers another subset called APAP-Bench 2D, a collection of 80 textured, planar meshes of various objects, to facilitate quantitative analysis and user study described later in this section. To create APAP-Bench 2D, we first generate 2 images of real-world objects for each of the 20 categories using Stable Diffusion-XL [42]. We then extract foreground masks from the generated images using SAM [31] and sample pixels that lie on the boundary and interior. The sampled pixels are used for Delaunay triangulation, constrained with the edges along the main contour of the masks, that produces 2D triangular meshes with texture. We assign two handle and anchor pairs to each mesh that imitate user instructions. For evaluation purposes, we populate the reference set by sampling 1,00010001,0001 , 000 images for each object category using Stable Diffusion-XL. The generated images are used to evaluate a perceptual metric to assess the plausibility of 2D mesh editing results as described in Sec. 4.3.

View 1 View 2 View 2 (Zoom In)
Source ARAP [54] Ours Source ARAP [54] Ours Source ARAP [54] Ours
Refer to caption
Figure 3: Qualitative results from 3D shape deformation. We visualize the source shapes and their deformations made using ARAP [54] and ours by following the instructions each of which specifies a handle (red), an edit direction denoted with an arrow (gray), and an anchor (green). We showcase the rendered images captured from two different viewpoints, as well as one zoom-in view highlighting local details.

Baselines.

We compare our method (APAP) and As-Rigid-As-Possible (ARAP) [54] since it is one of the widely used mesh deformation techniques that permits shape manipulation via direct vertex displacement. Throughout the experiments, we use the implementation in libigl [21] with default parameters.

Evaluation Metrics.

In 2D experiments, we conduct quantitative analysis based on k𝑘kitalic_k-NN GIQA score [14] as an evaluation metric to assess the plausibility of instance-specific editing results. The metric quantifies the perceptual proximity between the edited image and its k𝑘kitalic_k nearest neighbors in the reference set included in APAP-Bench 2D. As our objective is to make plausible variations of 2D meshes via deformation, an edited object should remain perceptually similar to other objects in the same category. We use k=12𝑘12k=12italic_k = 12 throughout the experiments.

4.2 3D Shape Deformation

Qualitative Results.

We showcase examples of 3D shape deformation where each deformation is specified by a handle (red), an edit direction (gray), and an anchor (green). As shown in Fig. 3, APAP is capable of manipulating 3D shapes to improve visual plausibility which is not achievable by solely relying on geometric prior such as ARAP [54]. For instance, given a user input that drags a handle on one blade of an axe (the first row) along an arrow, APAP simultaneously expands both blades of the axe whereas ARAP [54] produces distortions near the head. Similar examples that demonstrate symmetry-awareness of APAP can be found in other cases such as a car (the second row), and an owl (the sixth row) where a user lifts only one side of the shape upward and the symmetry is recovered by APAP which cannot be achieved by ARAP [54]. Also, note that APAP is capable of making a smooth articulation at the leg of the wolf (the fourth row) by adjusting the overall posture in comparison to ARAP which creates an excess bending.

Refer to caption
Figure 4: Failure cases of DragDiffusion. DragDiffusion [52] can easily compromise the identity of edited instances as it manipulates their latents without an explicit parameterization, the identity of instances can be broken during editing.

4.3 2D Mesh Editing

Refer to caption
Figure 5: Qualitative results from 2D mesh deformation. 2D meshes are edited using ARAP [54] and the proposed method following the edit instruction consisting of a handle (red), a target direction (gray), and an anchor (green). We showcase the rendered images of the edited meshes, as well as a zoom-in view highlighting local details.

Qualitative Evaluation.

We present qualitative results using the baselines and our method in Fig. 5. Each row shows two different results obtained by editing an image based on a handle moved from the original position (red) along a direction indicated by an arrow (gray) while fixing an anchor (green), similar to the 3D experiments discussed in the previous section.

As shown in Fig. 5, ARAP [54] enforces local rigidity and often results in implausible deformations. For example, it does not account for the mechanics of the human body and introduces an unrealistic articulation of a human arm (the fourth row). In addition, it twists the body of a sports car (the fifth row). Both of them originate from the lack of understanding of the appearance of objects. APAP alleviates this issue by incorporating a visual prior into shape deformation producing a bending near the elbow and preserving the smooth silhouette of the car, respectively.

While APAP is designed for meshes not images, we provide an additional qualitative comparison against DragDiffusion [52], an image editing technique that operates in pixel space, to demonstrate the effectiveness of mesh-based parameterization in applications where identity preservation is crucial. As shown in Fig. 4, DragDiffusion [52] may corrupt the identity of the instances depicted in input images during the encoding and decoding procedure. APAP, on the other hand, makes plausible variations of the given objects while maintaining their originality, benefiting from an explicit mesh representation it is grounded.

Methods k𝑘kitalic_k-NN GIQA (×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT) \uparrow
ARAP [54] 4.753
DragDiffusion [52] 4.545
Ours (hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT Only) 4.797
Ours (ARAP Init.) 4.740
Ours (Poisson Init.) 4.316
Ours 4.887
Table 1: Quantitative analysis for 2D mesh editing. APAP outperforms its baselines in quantitative evaluation using k𝑘kitalic_k-NN GIQA [14].
Methods Preference (%) \uparrow
ARAP [54] 40.83
Ours 59.17
Table 2: User study preference for 2D image editing. In a user study targeting users on Amazon Mechanical Turk (MTurk), the results produced using ours were preferred over the outputs from the baseline.

Quantitative Evaluation.

Tab. 1 summarizes k𝑘kitalic_k-NN GIQA scores measured on the outputs from ARAP [54] (the first row) and APAP (the sixth row) using APAP-Bench 2D. As shown, APAP demonstrates superior performance over ARAP [54]. This again verifies the observations from qualitative evaluation where ARAP [54] introduces distortions that harm visual plausibility. As in qualitative evaluation, we also report the k𝑘kitalic_k-NN GIQA score of DragDiffusion [52], degraded due to artifacts caused during direct manipulation of latents.

User Study.

We further conduct a user study for a more precise perceptual analysis. We follow Ritchie [46] and recruit participants on Amazon Mechanical Turk (MTurk). Each participant is provided with a set of 20202020 randomly sampled images of the source meshes paired with editing results of ARAP [54] and APAP. To check whether the response from a participant is reliable we present 5555 vigilance tests and collect 102102102102 responses from the participants who passed the vigilance test.

We instructed participants to select the most anticipated outcome when the displayed source image is edited by the dragging operation visualized as an arrow. We provide details of the user study in the supplementary material. Tab. 2 shows a higher preference of the participants on our method over ARAP [54] implying that our method produces more visually plausible deformations.

Ablation Study.

Tab. 1 summarizes the impact of different initialization strategies in the first stage on k𝑘kitalic_k-NN GIQA score. As reported in the third row of the table, optimizing hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT that aims to exclusively satisfy geometric constraints leads to unnatural distortions. We provide a qualitative comparison in the the supplementary material.

While designing the algorithm illustrated in Alg. 1, we considered other options for FirstStage. Instead of optimizing hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to initially deform a shape, we used a shape produced by ARAP [54] or by solving a Poisson’s equation constrained not only on anchor positions but also on handles at their target positions reached by following the given edit directions. We report k𝑘kitalic_k-NN GIQA scores of the alternatives in the fourth and fifth row of Tab. 1, respectively. Both initialization strategies degrade the plausibility of results due to large distortions introduced by either solely enforcing local rigidity or, finding least square solutions without updating Jacobians. This poses a challenge to the diffusion prior, making it struggle to induce meaningful update directions when provided with renderings with noticeable distortions, which can be found in qualitative analysis in the supplementary material.

5 Conclusion

We presented APAP, a novel deformation framework that tackles the problem of plausibility-aware shape deformation while offering intuitive controls over a wide range of shapes represented as triangular meshes. To this end, we carefully orchestrate two core components, a learnable Jacobian-based parameterization that originates from geometry processing and powerful 2D priors acquired by text-to-image diffusion models trained on Internet-scale datasets. We assessed the performance of the proposed method against an existing geometric-prior-based deformation technique and also thoroughly investigated the significance of our design choices through experiments.

Acknowledgements

We appreciate RECON Labs for their support in our experiments. This work was supported by NRF grant (RS-2023-00209723) and IITP grants (2022-0- 00594, RS-2023-00227592) funded by the Korean government (MSIT), Seoul R&BD Program (CY230112), and grants from the DRB-KAIST SketchTheFuture Research Center, Hyundai NGV, KT, NCSOFT, and Samsung Electronics.

References

  • [1] Luma AI. Genie.
  • [2] Noam Aigerman, Kunal Gupta, Vladimir G Kim, Siddhartha Chaudhuri, Jun Saito, and Thibault Groueix. Neural Jacobian Fields: Learning Intrinsic Map**s of Arbitrary Meshes. ACM TOG, 2022.
  • [3] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. In ICML, 2023.
  • [4] Ilya Baran and Jovan Popović. Automatic Rigging and Animation of 3D Characters. ACM TOG, 2007.
  • [5] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. In ICCV, 2023.
  • [6] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv preprint arXiv:1512.03012, 2015.
  • [7] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-XL: A Universe of 10M+ 3D Objects. In NeurIPS, 2023.
  • [8] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A Universe of Annotated 3D Objects. In CVPR, 2023.
  • [9] Yu et al. Mesh editing with poisson-based gradient field manipulation. ACM Trans. Graph., 23(3), 2004.
  • [10] Hugging Face. Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch.
  • [11] Hugging Face. DreamBooth fine-tuning with LoRA.
  • [12] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In ICLR, 2023.
  • [13] William Gao, Noam Aigerman, Groueix Thibault, Vladimir Kim, and Rana Hanocka. TextDeformer: Geometry Manipulation using Text Guidance. ACM TOG, 2023.
  • [14] Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. GIQA: Generated Image Quality Assessment. In ECCV, 2020.
  • [15] Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta Denoising Score. In ICCV, 2023.
  • [16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-Prompt Image Editing with Cross-Attention Control. In ICLR, 2023.
  • [17] Yelong Hu, Edward J.and Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR, 2022.
  • [18] Takeo Igarashi, Tomer Moscovich, and John F. Hughes. As-Rigid-as-Possible Shape Manipulation. ACM TOG, 2005.
  • [19] Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-As-Image for Semantic Typography. ACM TOG, 2023.
  • [20] Alec Jacobson, Ilya Baran, Jovan Popović, and Olga Sorkine. Bounded Biharmonic Weights for Real-Time Deformation. ACM TOG, 2011.
  • [21] Alec Jacobson, Daniele Panozzo, et al. libigl: A simple C++ geometry processing library, 2018.
  • [22] Ajay Jain, Amber Xie, and Pieter Abbeel. VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models. In CVPR, 2023.
  • [23] Tomas Jakab, Richard Tucker, Ameesh Makadia, Jiajun Wu, Noah Snavely, and Angjoo Kanazawa. KeypointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control. In CVPR, 2020.
  • [24] Jong Chul Ye Jangho Park, Gihyun Kwon. ED-NeRF: Efficient Text-Guided Editing of 3D Scene using Latent Space NeRF. arXiv, 2023.
  • [25] Pushkar Joshi, Mark Meyer, Tony DeRose, Brian Green, and Tom Sanocki. Harmonic Coordinates for Character Articulation. ACM TOG, 2007.
  • [26] Tao Ju, Scott Schaefer, and Joe Warren. Mean Value Coordinates for Closed Triangular Meshes. ACM TOG, 2005.
  • [27] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Popa Tiberiu. CLIP-Mesh: Generating Textured Meshes from Text Using Pretrained Image-Text Models. SIGGRAPH ASIA, 2022.
  • [28] Kunho Kim, Mikaela Angelina Uy, Despoina Paschalidou, Alec Jacobson, Leonidas J. Guibas, and Minhyuk Sung. OptCtrlPoints: Finding the Optimal Control Points for Biharmonic 3D Shape Deformation. Computer Graphics Forum, 2023.
  • [29] Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, and **woo Shin. Collaborative Score Distillation for Consistent Visual Synthesis. In NeurIPS, 2023.
  • [30] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [31] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment Anything. In ICCV, 2023.
  • [32] Juil Koo, Seungwoo Yoo, Minh Hieu Nguyen, and Minhyuk Sung. SALAD: Part-Level Latent Diffusion for 3D Shape Generation and Manipulation. In ICCV, 2023.
  • [33] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular Primitives for High-Performance Differentiable Rendering. ACM TOG, 2020.
  • [34] Yuseung Lee, Kunho Kim, Hyun** Kim, and Minhyuk Sung. SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions. In NeurIPS, 2023.
  • [35] Yaron Lipman, David Levin, and Daniel Cohen-Or. Green Coordinates. ACM TOG, 2008.
  • [36] Yaron Lipman, Olga Sorkine, Daniel Cohen-Or, David Levin, Christian Rössl, and Hans-Peter Seidel. Differential Coordinates for Interactive Mesh Editing. In Proceedings of Shape Modeling International, pages 181–190, 2004.
  • [37] Yaron Lipman, Olga Sorkine, David Levin, and Daniel Cohen-Or. Linear Rotation-Invariant Coordinates for Meshes. ACM TOG, 2005.
  • [38] Minghua Liu, Minhyuk Sung, Radomir Mech, and Hao Su. DeepMetaHandles: Learning Deformation Meta-Handles of 3D Meshes with Biharmonic Coordinates. In CVPR, 2021.
  • [39] Yuan Liu, Cheng Lin, Zijiao Zeng, ** Wang. SyncDreamer: Learning to Generate Multiview-consistent Images from a Single-view Image. arXiv, 2023.
  • [40] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. Text2Mesh: Text-Driven Neural Stylization for Meshes. In CVPR, 2022.
  • [41] Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold. ACM TOG, 2023.
  • [42] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv, 2023.
  • [43] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. DreamFusion: Text-to-3D using 2D Diffusion. In ICLR, 2023.
  • [44] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models from Natural Language Supervision. In ICML, 2021.
  • [45] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Ben Mildenhall, Nataniel Ruiz, Shiran Zada, Kfir Aberman, Michael Rubenstein, Jonathan Barron, Yuanzhen Li, and Varun Jampani. DreamBooth3D: Subject-Driven Text-to-3D Generation. In ICCV, 2023.
  • [46] Daniel Ritchie. Rudimentary framework for running two-alternative forced choice (2afc) perceptual studies on mechanical turk.
  • [47] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.
  • [48] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation. In CVPR, 2023.
  • [49] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  • [50] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis. In NeurIPS, 2021.
  • [51] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view Diffusion for 3D Generation. arXiv, 2023.
  • [52] Yujun Shi, Chuhui Xue, Jiachun Pan, Wenqing Zhang, Vincent YF Tan, and Song Bai. DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing. arXiv, 2023.
  • [53] J. Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3D Neural Field Generation using Triplane Diffusion. In CVPR, 2023.
  • [54] Olga Sorkine and Marc Alexa. As-Rigid-As-Possible Surface Modeling. Proceedings of EUROGRAPHICS/ACM SIGGRAPH Symposium on Geometry Processing, pages 109–116, 2007.
  • [55] Olga Sorkine, Daniel Cohen-Or, Yaron Lipman, Marc Alexa, Christian Rössl, and Hans-Peter Seidel. Laplacian Surface Editing. Proceedings of the EUROGRAPHICS/ACM SIGGRAPH Symposium on Geometry Processing, pages 179–188, 2004.
  • [56] Jiapeng Tang, Markhasin Lev, Wang Bi, Thies Justus, and Matthias Nießner. Neural Shape Deformation Priors. In NeurIPS, 2022.
  • [57] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. arXiv, 2023.
  • [58] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In CVPR, 2023.
  • [59] Yu Wang, Alec Jacobson, Jernej Barbič, and Ladislav Kavan. Linear Subspace Design for Real-Time Shape Deformation. ACM TOG, 2015.
  • [60] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023.
  • [61] Ofir Weber, Mirela Ben-Chen, and Craig Gotsman. Complex barycentric coordinates with applications to planar shape deformation. Computer Graphics Forum, 2009.
  • [62] Ofir Weber, Olga Sorkine, Yaron Lipman, and Craig Gotsman. Context-Aware Skeletal Shape Deformation. Computer Graphics Forum, 2007.
  • [63] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, Dahua Lin, and Ziwei Liu. OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation. In CVPR, 2023.
  • [64] Zhan Xu, Yang Zhou, Evangelos Kalogerakis, Chris Landreth, and Karan Singh. RigNet: Neural Rigging for Articulated Characters. ACM TOG, 2020.
  • [65] Zhan Xu, Yang Zhou, Li Yi, and Evangelos Kalogerakis. Morig: Motion-Aware Rigging of Character Meshes from Point Clouds. In SIGGRAPH ASIA, 2022.
  • [66] Wang Yifan, Noam Aigerman, Vladimir G. Kim, Chaudhuri Siddhartha, and Olga Sorkine. Neural Cages for Detail-Preserving 3D Deformations. In CVPR, 2020.
  • [67] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV, 2023.
  • [68] Yuechen Zhang, **bo Xing, Eric Lo, and Jiaya Jia. Real-World Image Variation by Aligning Diffusion Inversion Chain. In NeurIPS, 2023.
  • [69] **gyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, and Guanbin Li. DreamEditor: Text-Driven 3D Scene Editing with Neural Fields. ACM TOG, 2023.

Appendix

In this supplementary material, we first describe implementation details of our main pipeline (Sec. A.1) and details of APAP-Bench construction (Sec. A.2). We also provide the exact question given to user study participants, as well as an example questionnaire (Sec. A.3). Furthermore, we summarize additional experimental results including an ablation study for 2D mesh editing based on qualitative results (Sec. A.7), additional qualitative results for 3D shape deformation (Sec. A.4), more complex 3D shape deformations achieved by leveraging classical deformation techniques (Sec. A.5), and human evaluation on the plausibility of 3D deformations via user study (Sec. A.6).

A.1 Implementation Details

We provide additional implementation details of Alg. 1 . We used a modified version of the differentiable Poisson solver from [2], denoted by g𝑔gitalic_g in Alg. 1, and nvdiffrast [33] when implementing the differentiable renderer \mathcal{R}caligraphic_R in our pipeline. We render 2D/3D meshes at a resolution of 512×512512512512\times 512512 × 512.

When editing 2D meshes, we optimize hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for M=300𝑀300M=300italic_M = 300 iterations in the FirstStage and jointly optimize hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and SDSsubscriptSDS\mathcal{L}_{\text{SDS}}caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT for N=700𝑁700N=700italic_N = 700 iterations in the SecondStage. For experiments involving the optimization of 3D meshes with increased geometric complexity, we use M=300𝑀300M=300italic_M = 300 and N=1000𝑁1000N=1000italic_N = 1000 for each stage, respectively. We use ADAM [30] with a learning rate γ=1×103𝛾1superscript103\gamma=1\times 10^{-3}italic_γ = 1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT throughout the optimization. We use the Classifier-Free Guidance (CFG) scale of 100.0 and randomly sample t[0.02,0.98]𝑡0.020.98t\in[0.02,0.98]italic_t ∈ [ 0.02 , 0.98 ] when evaluating SDSsubscriptSDS\mathcal{L}_{\text{SDS}}caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT following DreamFusion [43].

We use a script from diffusers [10] to finetune Stable Diffusion [47] with LoRA [17]. We employ stabilityai/stable-diffusion-2-1-base as our base model and augment its cross-attention layers in the U-Net with rank decomposition matrices of rank 16161616. For the task of 2D mesh editing, we train the injected parameters for 60606060 iterations, utilizing a rendering of a mesh as a training image. In the 3D shape deformation, where renderings from 4 canonical viewpoints (front, back, left, and right) are available, we finetune the model for 200200200200 iterations. In both cases, we use the learning rate γ=5×104𝛾5superscript104\gamma=5\times 10^{-4}italic_γ = 5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

A.2 Details of APAP-Bench

Image Generation.

For evaluation purposes, we build APAP-Bench 2D by generating 2 images of real-world objects for each of the 20 categories using Stable Diffusion-XL [42] as noted in the main paper. We segment the foreground objects from the generated images and run Delaunay triangulation to populate a collection of 2D meshes. When generating the images, we use the following template prompt "a photo of [category name] in a white background" for all categories to facilitate foreground object segmentation. Tab. A3 summarizes the list of categories. Note that the list includes both human-made and organic objects that can be easily found in the daily environment to test the generalization capability of a deformation technique to various object types.

Human-Made Organic
backpack flying bird
bike side view of cat
chair side view of dog
high-heeled shoes runway model
purse sitting bird
side view of car standing cheetah
sneakers standing dragon
table standing raccoon
airplane standing sheep
standing white duck
starfish
Table A3: Object categories of 2D meshes in APAP-Bench 2D. APAP-Bench 2D includes 2D triangle meshes depicting various objects, including both human-made and organic objects.

Handle and Anchor Assignment.

We manually assign two handle and anchor pairs to each mesh to imitate user instructions. Specifically, we choose vertices on the shape boundaries instead of internal vertices to induce deformations that alter object silhouettes. For instance, users would try to drag the bottom of a backpack downward to enlarge the shape, instead of dragging an interior point which may flip triangles, distorting the appearance. As an anchor, we use the vertex closest to the center of mass of each mesh.

In experiments using APAP-Bench 3D and APAP-Bench 2D, we note that utilization of neighboring vertices of the given handles and anchors during deformation helps retain smooth geometry near the handle. Therefore, we additionally sample vertices near the handles and anchors that lie in the sphere of radius r=0.01𝑟0.01r=0.01italic_r = 0.01 and denote the extended sets of handles and anchors region handles and region anchors, respectively. We use region anchors and a single handle for 3D experiments and region anchors and region handles for 2D cases. Note that we use the same sets of handles and anchors when deforming shapes with our baselines for fair comparisons.

A.3 Details of User Study for 2D Mesh Editing

In Sec. 4 of the main paper, we reported the preference statistics collected from 102 user study participants who passed the vigilance tests. We provide additional details of the user study in the following. We instructed participants to select the most anticipated outcome when the displayed source image is edited by the dragging operation visualized as an arrow with the question: "A visual designer wants to modify the object by clicking on a red point and dragging it in the direction of the arrow. Please choose a result that best satisfies the designer’s edit, while retaining the characteristics and plausibility of the object."

Fig. A6 (left) shows an example of a questionnaire provided to the participants. For vigilance tests, we included an editing result from DragDiffusion [52] depicting an object irrelevant to the source image in each question. The participants were asked to answer the same question. We illustrate an example questionnaire of a vigilance test in Fig. A6 (right).

Refer to caption Refer to caption
Figure A6: Examples of questionnaires displayed during the user study (2D mesh editing). During the user study, we asked the participants to evaluate 20 different result pairs from ARAP [54] and ours as shown on the left. To check whether a participant is focusing on the user study, we included 5 items for the vigilance test. As shown on the right, a question for the vigilance test includes an image of an object that is not related to the source image.

A.4 Additional Qualitative Results for 3D Shape Deformation

Fig. A7 summarizes outputs of 3D shape deformation with additional results. As reported , ARAP [54] only enforces local rigidity and hence cannot produce smooth deformations intended by users. In the ninth row, ARAP [54] introduces a pointy end given an editing instruction that drags the bottom of a doll downward. Ours, however, elongates the entire geometry smoothly, producing a more visually plausible deformation. Another example displayed in the tenth row shows similar behaviors of ARAP [54] and ours, respectively. Here, unlike ARAP [54], the proposed method adjusts the overall proportion of the statue as the handle located at the tail is translated, while preserving the smooth and round geometry near the handle.

View 1 View 2 View 2 (Zoom In)
Source ARAP [54] Ours Source ARAP [54] Ours Source ARAP [54] Ours
Refer to caption
Figure A7: Additional qualitative results from 3D shape deformation. We visualize the source shapes and their deformations made using ARAP [54] and ours by following the instructions each of which specifies a handle (red), an edit direction denoted with an arrow (gray), and an anchor (green). We showcase the rendered images captured from two different viewpoints, as well as one zoom-in view highlighting local details.

A.5 Complex 3D Deformation Examples

In addition to the ability to optimize Jacobian fields using diffusion priors offered by the linearity of Poisson solvers, we can directly propagate local transforms, additionally defined at handle vertices, to Jacobians of neighboring faces by employing geodesic distances as weights [9]. This allows for more dramatic deformations illustrated in Fig. A8, involving limb articulations, large bending, and the use of multiple handles and anchors. As represented in the Panda (the seventh, eighth, ninth columns) example, our framework can handle large pose variations, useful in downstream applications, such as animation.

Source (View 1) Ours (View 1) Ours (View 2) Source (View 1) Ours (View 1) Ours (View 2) Source (View 1) Ours (View 1) Ours (View 2) Source (View 1) Ours (View 1) Ours (View 2)
[Uncaptioned image]
Figure A8: Examples of deforming source meshes using multiple handles and anchors. Best viewed in Zoom-in.

A.6 User Study Results for 3D Shape Deformation

Assessing the visual plausibility of 3D deformations is particularly challenging due to the difficulty in populating large-scale reference sets as we did for 2D meshes in Sec. 4 . We further note that, unlike 3D generative models, computing image-based metrics such as CLIP-R score is non-trivial since it is hard to describe handle-based deformations solely using text prompts.

Refer to caption Refer to caption
Figure A9: Examples of questionnaires displayed during the user study (3D shape deformation). During the user study, we asked the participants to evaluate 20 different result pairs from ARAP [54] and ours as shown on the left. To check whether a participant is focusing on the user study, we included 5 items for the vigilance test. As shown on the right, a vigilance test asks a participant to compare two images, with one of them containing noticeable artifacts.

Therefore, we conduct a user study similar to the one presented in Sec. 4 . We asked 47 user study participants on Amazon Mechanical Turk (MTurk) to compare rendered images of meshes deformed using ARAP [54] and ours. Each participant is provided with 20 image pairs and asked to select one image at each time given the question: "Which edited image is more realistic and plausible? Choose one of the following images." An example of a questionnaire displayed to the participants is shown in Fig. A9 (left). We provide an example of vigilance tests, similar to the user study for 2D mesh editing, in Fig. A9 (right). As summarized in Tab. A4, the deformation produced by our method is preferred over the results from the baseline.

Methods Preference (%) \uparrow
ARAP [54] 41.7
Ours 58.3
Table A4: User study preference for 3D mesh deformation. In a user study targeting users on Amazon Mechanical Turk (MTurk), the results produced using ours were preferred over the outputs from the baseline.

A.7 Ablation Study for 2D Mesh Editing

In this section, we provide qualitative results from the ablation study to validate the impact of each component on the plausibility of editing results. In Fig. LABEL:fig:suppl_2d_ablation, we summarize the results obtained by (1) optimizing only hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, (2) hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and SDSsubscriptSDS\mathcal{L}_{\text{SDS}}caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT without LoRA finetuning, (3) skip** the FirstStage, (4) using ARAP initialization, (5) using Poisson initialization, and (6) Ours. As mentioned in the main paper, optimizing only hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (the second column) either distorts texture (the fifth row) or inflates or shrinks other parts of the given shape (the seventh and twelfth row). This demonstrates the necessity of a visual prior during deformation. Also, we observe the cases where skip** the FirstStage (the fourth column) does not lead to intended deformation as our diffusion prior is reluctant to modify shapes from their original states (the first, second, and fifth row). On the other hand, deformations initialized with the meshes produced by ARAP [54] (the fifth column) or Poisson solve (the sixth column) suffer from distortions that could not be resolved by optimizing SDSsubscriptSDS\mathcal{L}_{\text{SDS}}caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT in the SecondStage.

\CatchFileDef\AllComparisonImages

Figures/supp/ablation/image_list.tex

Table A5: Ablation study for 2D mesh editing. We examine the impact of each design choice on deformation outputs, including the use of diffusion prior (the second column), LoRA finetuning (the third column), two-stage pipeline (the fourth column), and initialization strategies during the FirstStage (the fifth and sixth column).
Source hsubscript\mathcal{L}_{h}caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT Only No LoRA [17] SecondStage Only ARAP Init. Poisson Init. Ours
\AllComparisonImages