HeadEvolver: Text to Head Avatars via Expressive
and Attribute-Preserving Mesh Deformation

Duotun Wang^1∗, Hengyu Meng^1,3∗, Zeyu Cai¹, Zhi**g Shao¹, Qianxi Liu¹, Lin Wang^1,4
Mingming Fan^1,4, Xiaohang Zhan², Zeyu Wang^1,4
¹The Hong Kong University of Science and Technology (Guangzhou) ²Tencent AI Lab
³South China University of Technology ⁴The Hong Kong University of Science and Technology
indicates equal contributions

Abstract

Current text-to-avatar methods often rely on implicit representations, leading to 3D content that artists cannot easily edit and animate in graphics software. This paper introduces a novel framework for generating stylized head avatars from text guidance, which leverages locally learnable mesh deformation and 2D diffusion priors to achieve high-quality digital assets for attribute-preserving manipulation. Given a template mesh, our method represents mesh deformation with per-face Jacobians and adaptively modulates local deformation using a learnable vector field. This vector field enables anisotropic scaling while preserving the rotation of vertices, which can better express identity and geometric details. We also employ landmark- and contour-based regularization terms to balance the expressiveness and plausibility of generated head avatars from multiple views without relying on any specific shape prior. Our framework can not only generate realistic shapes and textures that can be further edited via text, but also support seamless editing using the preserved attributes from the template mesh, such as 3DMM parameters, blendshapes, and UV coordinates. Extensive experiments demonstrate that our framework can generate diverse and expressive head avatars with high-quality meshes that artists can easily manipulate in 3D graphics software, facilitating downstream applications such as more efficient asset creation and animation with preserved attributes. Our project page is at: https://www.duotun-wang.co.uk/HeadEvolver/

1 Introduction

Head avatar modeling is an important and challenging task in visual computing due to its increasing need in applications such as games [17], movies [20], virtual reality [46], and online education [8]. Traditionally, creating a high-fidelity head avatar demands skilled technical artists to invest considerable time and effort into modeling and animation. While recent advances in AI-generated content (AIGC), particularly in text-to-3D, offer new avenues for accessible and cost-effective avatar creation, some practical gaps persist. For example, several methods [44, 56, 52] used generative models [16, 14, 42] to support text-guided human head generation. Nevertheless, these methods suffer from acquiring high-quality and diverse datasets for training. Another set of methods [15, 22, 54, 31] leveraged frozen large-scale vision language models (VLMs) to create 3D head avatars. These methods use 2D priors to provide multi-view guidance without requiring 3D training data and thus produce more diverse results.

However, there is a significant gap between current text-to-avatar methods and 3D asset creation, as most methods do not effectively integrate high fidelity, user customization, and animation support. This is mainly caused by using implicit shape representations that cannot preserve 3D attributes essential for downstream manipulation, e.g., mesh topology, rig, UV map**. For instance, HeadArtist [31] and HeadSculpt [15] leverage DMTet [40] to create head avatars with 3D Morphable Models (3DMMs) [4, 27] as strong shape priors. While this approach allows for a realistic appearance, the implicit shape representation DMTet used in these works tends to discard statistical information such as shape and expression parameters from 3DMMs and produce noisy meshes.

Head avatar generation using an explicit mesh representation would benefit downstream applications because of its modifiability and compatibility with graphics pipelines, although there is less existing research in this direction. A notable method is TADA [29], which incorporates the optimizations of vertex displacement and SMPL-X parameters [35]. This strategy preserves the semantics of 3DMMs and enables animatable 3D head avatar generation. However, direct vertex displacement tends to deform meshes with self-intersected faces and noisy normals, leading to limited mesh quality and animation artifacts. TextDeformer [10] can change an input mesh smoothly and stably by making deformation via Jacobians, but it cannot generate a textured appearance and lacks local geometry details.

This paper aims to mitigate the trade-offs between creating high-quality avatars and the ease of animation and editing without reconstructing blendshapes or rigs. We propose a novel text-to-avatar framework that produces stylized head avatars through expressive and attribute-preserving mesh deformation. Given a single template mesh such as FLAME [27], ICT-Face [26], or any manifold mesh preferably with facial semantics, our framework deforms it into a desired target shape while preserving feature correspondences such as skinny weights and UV coordinates for further animation and editing (Fig. LABEL:fig:teaser(c)).

The key observation is that although the mesh deformation via optimizing Jacobians supports smooth shape changes at a global scale [1], it falls short in fully expressing the semantics by optimizing image losses and performing locally anisotropic deformation. This is essential for generating diverse styles, such as the elongated nose in Fig. LABEL:fig:teaser(b). Furthermore, it is often laborious to tune deformation hyper-parameters to achieve both text-guided expressiveness and shape smoothness [34]. Therefore, we propose per-triangle learnable weighting factors coupled with Jacobians to enhance fine-grained deformation over the head mesh, which we call vector fields.

In order to build a complete 3D head avatar generation framework via gradient-based optimization, we employ a pre-trained diffusion model [36] with 3D awareness by leveraging a frozen landmarkguided ControlNet [57, 33]. Given a camera pose, we render shaded and normal images and obtain the opacity mask and the predefined vertex-indexed landmarks projected from the 3D head. As Fig. 1 illustrates, the stable diffusion model accepts text prompts, rendered images, and landmarks as additional conditions from ControlNet to compute the Score Distillation Sampling (SDS) loss. In addition, to mitigate the issues of unexpected geometry distortions or self-intersected triangles, we introduce landmark- and contour-based regularization terms to properly constrain the avatar deformation. For example, a good-shaped avatar is likely to be vertically symmetric. To summarize, we make the following contributions:

•

We introduce per-triangle vector fields to enhance the expressiveness of mesh deformation over local regions while preserving 3D attributes.
•

We propose a diffusion-based framework with avatar regularization terms to support text-to-avatar generation with high-quality mesh.
•

Extensive experiments demonstrate that our method can generate stylized 3D head avatars supporting asset creation with interactive editing and animation.

2 Related Work

Table 1: Compared to other text-to-avatar approaches, our method preserves predefined 3D attributes (e.g., vertex-indexed landmarks, rig, blendshapes), is independent of parametric models for joint optimization, produces mesh models of head avatars compatible with further manipulation in 3D software, and requires no training data nor fine-tuning.

	3D Attribute	Independence from	Compatible with	No Training Data
	Preservation	Parametric Models	Graphics Pipelines	nor Fine-tuning
HeadScuplt [15]	✗	✓	✗	✗
Fantasia3D [6]	✗	✓	✓	✓
TADA [29]	✓	✗	✓	✓
DreamFace [56]	✓	✗	✓	✗
Ours	✓	✓	✓	✓

Recently, there has been rapid research progress in digital human modeling and text-to-3D content generation. Our research centers on text-guided synthesis of head avatars employing an explicit mesh representation, aiming to fulfill four criteria outlined in Table 1: 1) preservation of predefined 3D attributes, 2) support for template mesh independent from parametric models, 3) seamless compatibility with graphics pipelines, and 4) gradient-based optimization without requiring training data nor fine-tuning. This section reviews related literature on text-to-3D avatar generation and learning-based mesh optimization.

2.1 Text-to-3D Avatar Generation

The success of text-driven generative models has sparked a wide range of research works in 3D content synthesis. These methods are based on different geometry representations such as NeRF [30, 45], tri-plane [41], mesh[6], and Gaussian Splatting [43, 28]. For text-to-avatar generation tasks, numerous efforts have been made to develop 3D generative models [44, 56, 51, 53] typically based on GAN [13] or diffusion model [16]. These methods often demand extensive 3D head data for training, incurring high costs. Moreover, they face challenges due to limited 3D datasets for creating diverse head avatars matching text prompts.

To bypass the extensive training process of 3D generative models and insufficient 3D data, recent studies [18, 55] have made remarkable advances by harnessing the capabilities of vision-language models (e.g., CLIP [37] or SDS [36]) to generate static human avatars from text prompts. DreamWaltz [21] aims to enhance the realism of avatar animations with pose-guided ControlNet [57]. Similarly, DreamHuman [25] employs a signed distance field conditioned on pose and shape parameters to empower animations with NeRF. However, extracting, editing, and animating an explicit mesh from these works are not straightforward [21] (e.g., use marching cube or tetrahedra algorithms) since NeRF and DMTet [40], for instance, represent shapes through network weights. Meanwhile, the issue of poor mesh quality is often concealed by texture, as the geometry and texture optimizations are fully mixed. Recent works [6, 15, 31] disentangle the learning of geometry and texture to improve the avatar mesh quality. Our work also leverages differentiable rendering, stable diffusion, and a hybrid optimization process for mesh and texture synthesis, specifically emphasizing the deformation of an explicit geometry template through semantic guidance instead of generation from scratch or using implicit representations.

2.2 Learning-Based Mesh Optimization

Parametric mesh templates have been a prominent research focus for human avatar reconstruction and generation tasks, as they are well-suited for deep learning approaches. AvatarClip [19] and CLIP-Face [2] employ parametric models such as SMPL-X [35] and FLAME [27] as geometry prior to produce text-aligned avatars. DreamAvatar [5] extracts the shape parameters of SMPL as a 3D prior to learn a NeRF-based color field. HeadSculpt [15] and HeadArtist [31] utilize the FLAME template as canonical shape input and extract its facial landmark to guide avatar generations via DMTet. However, meshes extracted from implicit representations tend to lose semantic information of 3DMMs [4], posing challenges for downstream applications such as interactive editing and animation.

For manipulating geometry through deformation, recent works propose data-driven approaches [3, 1] to predict realistic deformation and some methods focus on learning deformation over one template mesh through tuning per-vertex coordinates [50, 58, 9]. TADA [29] directly optimizes vertex displacements along with the SMPL-X’s pose and shape parameters and achieves animatable avatars. Nonetheless, this characteristic is highly dependent on the quality of 3DMMs. Straightly optimization of non-convex mesh structures often converges to undesirable local minima, resulting in noticeable artifacts on surface details. TextDeformer [10] initially integrates CLIP into a learning-based deformation pipeline to address the scarcity of 3D datasets that pair shapes with text semantics. Our approach extends this by optimizing local semantic deformations on any template mesh, independent of its shape, expression, or blendshapes [4]. Our goal is to enhance identity expressiveness guided by text. The resulting head avatars can be animated using the original rigging from the source mesh and seamlessly integrated into existing rendering and editing workflows for design and artistic purposes.

3 Methodology

Refer to caption — Figure 1: Framework overview. We deform a template mesh by optimizing per-triangle vector fields guided by a text prompt. Rendered normal and RGB images coupled with MediaPipe landmarks are fed into a diffusion model to compute respective losses. Our regularization of Jacobians controls the fidelity and semantics of facial features that conform to text guidance.

In this section, we describe the details of our generation pipeline, as shown in Fig. 1. Given a source template mesh, we jointly optimize the mesh deformation and an albedo map guided by a text prompt. Our approach is not limited to a specific parametric model (e.g., FLAME) and can be applied to any manifold mesh. Figs. 15 and 16 show the generalization of our approach.

3.1 Mesh Deformation through Jacobian Fields

Let $\mathcal{M}=(\mathcal{V},\mathcal{F})$ denote a source template mesh which consists of a set of vertices $\mathcal{V}\in\mathbb{R}^{n\times 3}$ and faces $\mathcal{F}\in\mathbb{Z}^{m\times 3}$ . For the optimization strategy of mesh deformation, directly applying displacement to each vertex can result in severe self-interactions of triangle faces since it is susceptible to localized noisy gradients from pixel-level losses. Instead, we parameterize deformation map** $\Phi:\mathbb{R}^{n\times 3}\to\mathbb{R}^{n\times 3}$ by employing a Jacobian field [1] $J=\{J_{f}|f\in\mathcal{F}\}$ that represents the scaling and rotation of each mesh face. A dual representation of $\mathcal{M}$ can be defined as a stack of per-face Jacobians $J_{f}\in\mathbb{R}^{3\times 3}$ where

J_{f}=\nabla_{f}\mathcal{V},

(1)

and $\nabla_{f}$ is the gradient operator of triangle $f$ . Conversely, we aim to optimize each vertex’s position from the per-face Jacobian.

After obtaining optimized Jacobians in each iteration, we can compute the deformation map** $\Phi$ over a set of vertices conforming to the source mesh topology by solving a Poisson problem, i.e.,

\Phi^{\ast}=\text{arg}\min_{\Phi}\sum_{f\in\mathcal{F}}|f|\lVert\nabla_{f}(% \Phi)-J_{f}\rVert^{2}_{2},

(2)

where $\nabla_{f}(\Phi)$ is the Jacobian of $\Phi$ at a triangle face $f$ and $|f|$ is the area of the face. Thus, the deformation map $\Phi$ over the input template mesh is indirectly optimized through Jacobians $J_{f}$ in the least square sense. The solution can be obtained by using a differentiable solver [1, 10] implementing Cholesky decomposition:

\Phi^{\ast}=L^{-1}\nabla^{T}\mathcal{A}J,

(3)

where $\nabla$ is a stack of per-face gradient operators, $\mathcal{A}\in\mathbb{R}^{3m\times 3m}$ is the mass matrix of $\mathcal{M}$ , and $L\in\mathbb{R}^{3n\times 3n}$ is the cotangent Laplacian of $\mathcal{M}$ . This learnable shape representation based on Jacobians is robust to noisy gradients. It can achieve better geometry quality (e.g., fewer self-intersections and noisy normals) than direct vertex displacements, which is further justified in Sec. 4.

However, during our experiment, we observed that the poisson solver applied in Eq. 2 can impose strong geometry constraints for Jacobians to reach the global-coherence mesh deformation [10], which may lose fine-grain details for local shape features and compromise the identity of the generated head mesh (Fig. 5). A straightforward approach is to increase the learning rate for Jacobians to enhance their sensitivities to propagated gradients. This amplifies the rotation and scaling of each face, resulting in stronger deformations (Sec. 4.3). However, intensified rotations raise the likelihood of inverted triangles [39], leading to an unstable optimization process and a poor-quality mesh. To address these concerns, we introduce a learnable per-face vector field $H=\{H_{f}|f\in\mathcal{F}\}$ to enhance the deformation expressiveness of Jacobian field $J$ , i.e.,

\begin{split}W_{f}&=diag(w_{x},w_{y},w_{z}),\\ H_{f}&=W_{f}J_{f}.\end{split}

(4)

By substituting $H$ into Eq. 3, the deformation map** could be computed by

\Phi^{\ast}=L^{-1}\nabla^{T}\mathcal{A}H.

(5)

The effect of $W=\{W_{f}|f\in\mathcal{F}\}$ is to control the anisotropic scaling (e.g., stretching and shrinking in different directions) on each face. As shown in Fig. 2, it can adaptively learn the degree of anisotropic scaling on the corresponding triangular faces that can represent the character’s features. Thus, the per-face vector field $H$ can prioritize the Poisson solver to approach the desired deformations to express the identity of head avatars more effectively than applying Jacobians only. The detailed validation is in Sec. 4.3.

3.2 Text-Guided 3D Synthesis

We aim to build a text-guided head avatar generation framework based on differentiable mesh deformation and texture generation. Motivated by recent success in large-scale VLMs and text-to-3D literature discussed in Sec. 2.1, we utilize a powerful 2D diffusion prior (i.e., SDS [36]) to direct deformations by scoring the rendering realism of the generated shape in our framework.

Specifically, by following the procedure described in Sec. 3.1, we denote $\mathcal{V^{*}}$ as the set of deformed vertices from $J$ . Then, by using a differentiable render $\mathbf{R}$ , we render $\mathcal{M^{*}}=(\mathcal{V}^{*},\mathcal{F})$ from a viewpoint delineated by camera extrinsic parameter $C$ , which produces an image $I=\mathbf{R}(\mathcal{M^{*}},C)$ . The SDS loss is leveraged to alter the geometry and color respectively, that is:

\footnotesize\nabla_{J}\mathcal{L}_{\mathrm{SDS}}(\epsilon,I)=\mathbb{E}_{t,% \epsilon}\left[w(t)((\epsilon_{\theta}(z_{t};y,C,t)-\epsilon)\frac{\partial{I}% }{\partial{J}}\right],

(6)

where $\epsilon\sim\mathcal{N}(0,I)$ , $\epsilon_{\theta}$ is the pre-trained diffusion priors with the network parameter $\theta$ , $w(t)$ is a weight coefficient related to timestep $t\sim\mathcal{U}(0,1)$ , $y$ is the text input condition, $z_{t}$ denotes a latent embedding of $I$ perturbed with noises at time step $t$ , and $C$ is the MediaPipe [33] facial landmark map. Rendered image $I$ can be obtained from the normal $I_{n}$ and shaded $I_{s}$ (Fig. 1).

To ensure deformation map** from propagated gradients and the Poisson solver can conform to the facial topology and reasonable semantics of the source mesh, we further develop two regularization terms, aiming to preserve 3D attributes of generated head avatars (Figs. 12 and 13).

Landmark regularization. The landmark regularization measures the difference between predefined $N$ 3D face landmarks $\mathbf{k}_{i}\in\mathbb{R}^{3}$ on the canonical template mesh $\mathcal{M}$ and the corresponding ones on the deformed mesh $\mathbf{k}^{\prime}_{i}\in\mathbb{R}^{3}$ after projection. The landmark regularization loss is defined as $\mathcal{L}_{lmk}=\frac{1}{N}\sum_{i=1}^{N}||\mathbf{k}_{i}-\mathbf{k}^{\prime% }_{i}||^{2}_{2}$ .

Evolving contour-based regularization. Inspired by AvatarCraft [23], we incorporate an opacity map $O$ to preserve a plausible head shape while minimizing excessive or undesired vertex displacement. To balance diverse geometry shapes and constraints from severe deformations, the source template mesh is periodically updated from deformed mesh every 300 steps [47]. The pixel-level loss is parameterized as $\mathcal{L}_{op}=\frac{1}{HW}\sum_{H,W}||O-O^{\prime}||^{2}_{2}$ , where $H$ and $W$ represent the height and width of the opacity mask of the template mesh $O$ and the optimized head avatar $O^{\prime}$ . The total loss is defined as follows:

\mathcal{L}=\lambda_{1}\nabla_{\Phi}\mathcal{L}_{\mathrm{SDS}}+\lambda_{2}% \mathcal{L}_{lmk}+\lambda_{3}\mathcal{L}_{op},\vspace{-0.05cm}

(7)

where $\lambda_{1}=1$ , $\lambda_{2}=200$ , and $\lambda_{3}=250$ are the hyper-parameters in our experiment settings.

4 Experiments

In this section, we present experiments to evaluate the efficacy of our method both quantitatively and qualitatively for text-to-avatar creation. Then, we demonstrate the results of the ablation study that validate the significance of our key designs for vector field and regularization terms in Sec. 4.3.

Implementation details. For geometry and texture generation, we render normal and shaded images at a resolution of $512\times 512$ pixels and feed them into the Realistic Vision 5.1 (RV 5.1) [24]. Compared with Stable Diffusion 2.1 [38], we find that RV 5.1 supports a more appealing avatar appearance. The multi-face Janus and texture misalignment problems are further alleviated by introducing ControlNetMediaPipeFace [57]. The generation process runs on a single NVIDIA RTX $4090$ GPU with $20,000$ iterations per avatar, which takes around $70$ minutes. ResAdapter [7] is further utilized to generate the albedo map with a 1024 $\times$ 1024 resolution. Negative prompts are directly leveraged to enhance the realism and details of the texture, e.g., replacing the empty prompt in classifier-free guidance (CFG) with “unrealistic,” “blurry,” “low quality,” “saturated,” “low contrast,” etc. In addition to the proposed regularization terms, we enforce symmetry along the y-axis in each iteration to constrain the overall position offsets of the generated mesh.

4.1 Qualitative Evaluation

Our evaluation compares two explicit mesh manipulation methods from TADA [29] and TextDeformer [10], and one implicit method from Fantasia3D [6] as baselines. We do not compare with HeadArtist [31] and DreamFace [56] due to their lack of open-sourcing and primary focus on rendering quality. Instead, our framework emphasizes on generating high-quality mesh topologies. We utilized FLAME’s neutral expression as a unified input template mesh to conduct experiments. Visual comparison results are in Fig. 3. TADA and Fantasia3D demonstrate their capabilities of detail-preserving geometry with 3D head priors. However, their results still suffer from geometric artifacts in semantically dense areas, such as noisy vertices around eye regions, although this is less noticeable in a textured rendering. Our method mitigates this issue with the spatial alignments guided by landmarks and opacity regularization terms. Compared to TextDeformer, our approach exhibits similar mesh quality and amplifies facial characteristics following text specifications via proposed vector fields.

In summary, our framework generates expressive human head meshes that inherit facial attributes. The texture generation aligns with the geometry to create 3D assets ready for graphics applications. As depicted in Fig. 8 and Fig. 9, with text guidance, generated models from our method could be effectively manipulated to change appearance. Fig. LABEL:fig:teaser(e) and Fig. 14 illustrate the potential for seamless utilization of these avatars in 3D software by artists, as they exhibit distinct facial features and fewer artifacts.

Table 2: Quantitative comparison of state-of-the-art methods. The geometry CLIP score is measured on the shading images rendered with uniform albedo colors, the appearance CLIP score is evaluated on the images rendered with textures, and the self-intersection is quantified as a ratio between the number of self-intersections in the deformed mesh and the number of faces.

\uparrow

indicates higher is better and

\downarrow

indicates lower is better.

Metric	Ours	TADA	TextDeformer	Fantasia3D
Geometry CLIP $\uparrow$	0.2506	$0.2475$	$0.2462$	$0.1815$
Appearance CLIP $\uparrow$	0.2955	$0.2931$	-	$0.2718$
Self-Intersection $\downarrow$	2.08%	$14.22\%$	$2.19\%$	-

4.2 Quantitative Evaluation

We quantitatively evaluated our framework in generation consistency with text input and generation quality.

Generation Consistency with Text. We first evaluated the relevance of generated results to the corresponding text descriptions by calculating the CLIP score [37]. In addition, to separate the geometric attributes from the influence of texture, we rendered the generated geometry with a uniform albedo. Subsequently, we calculated the geometry CLIP score as a supplementary measure for evaluating the generation consistency of geometry with text prompts. We rendered 20 distinct views and computed the average scores respectively, as shown in Table 2. Our approach achieves the highest scores compared to the baseline methods.

Mesh Quality. We calculated the metric of self-intersection to measure mesh quality, which is the ratio of the number of self-intersections over the number of faces in the deformed mesh. Since the mesh from Fantasia3D is extracted through the marching cubes algorithm [32], it may produce skinny triangles containing small edges and angles with no self-intersections. Therefore, we only compared our method with TADA and TextDeformer. Unlike our method, geometry manipulation through direct vertex displacement tends to converge to a local minimum and disregards the triangulation of the shape.

User Study. We conducted a user study to evaluate the robustness and expressiveness of our method with 10 distinct text prompts, 4 of which are from TADA. We used a Google Form to assess 1) geometry quality, 2) texture quality, and 3) consistency with text. We recruited 46 participants, of whom 26 are graduate students majoring in computer graphics and vision, and 20 are company employees specializing in AI content generation. In this form, the participants were instructed to choose the preferred renderings of head avatars from different methods in randomized order, as shown in Table 3. The results show that participants preferred our method by a significant margin compared with the baseline methods in all three metrics.

Table 3: User evaluation of generated head avatars. Compared to other methods, ours is preferred by over 65% users in terms of geometry quality, texture quality, and consistency with text.

Preference $\uparrow$	Ours	TADA	TextDeformer	Fantasia3D
Geometry Quality	65.9%	$3.9\%$	$29.8\%$	$0.4\%$
Texture Quality	65.2%	$31.9\%$	-	$2.9\%$
Consistency with Text	66.9%	$28.9\%$	$2.3\%$	$1.9\%$

4.3 Ablation Study

Effect of Vector field. We conducted three sets of qualitative comparisons to demonstrate the effectiveness of the per-triangle vector field $H$ . As shown in Fig. 5, using Jacobian $J$ alone tends to overly smooth the gradients, leading to a loss of distinctive character features. Although increasing the learning rate of $J$ highlights character identities, generated meshes are distorted and suffer from severe self-intersections. Furthermore, our experiments introduce the scalar field $S$ which represents the isotropic scaling by setting diagonal elements of each $H_{i}$ to equal values. The results indicate that anisotropic scaling guided through $H$ generates avatars bearing more geometry details and character features than those through $S$ (e.g., the eyebrow of ”Octopus Alien” and the ear of ”Goblin”). Furthermore, we demonstrate the tuning process of the learning rate (LR) to evaluate that the addition of $H$ effectively enhances the stability and expressiveness of deformations when the LR of $J$ is set in the reasonable range (e.g., between $0.01$ to $0.025$ ). As shown in Fig. 4, we provide the best LR settings, which are $0.015$ for $H$ and 0.01 for $J$ respectively. We also demonstrate the generality of vector field by applying it to the object generation framework (Fig. 16) and observe that the use of $H$ can make optimization converge to desired deformation states faster than using Jacobians only.

Regularization. In deformation optimization, facial features and head poses may deviate unexpectedly. Landmark regularization constrains generated meshes to the input shape’s canonical space, and contour-based regularization contributes to reasonable head shapes. We demonstrate the effectiveness of our regularization designs with another ablation study, as shown in Fig. 6.

5 Applications

Our method can produce head avatars that can be smoothly manipulated (e.g., animated via 3DMMs and text-based editing) and are compatible with graphics tools such as Blender and Maya. Compared with implicit representations, our method enables artists to interactively create, edit, and animate digital head avatar assets for various downstream applications.

Local geometry editing. Besides manipulating generated meshes in 3D software as artists usually do, our approach also supports efficient head editing with text descriptions (Figs. LABEL:fig:teaser(b) and 8). Only $H_{i}$ on certain parts is optimized given the text guidance.

Texture editing through text guidance. Our approach encourages independent texture generation on a fixed geometry, which enables texture editing specified by the text input (Fig. 9). Some unexpected color artifacts may be introduced due to the noises from the stable diffusion model.

Avatar animation. As shown in Figs. LABEL:fig:teaser(c) and 10, we demonstrate that the rigging and blendshapes of the head mesh model after deformation are well-preserved for creating head animations and shape manipulations.

Avatar generation from other template meshes. As described in Sec. 3, our approach is not constrained by specific mesh models (e.g., FLAME) as long as they are manifold mesh. Fig. 15 shows an example of generating a “Superman” head from crafted 3D assets even without labeling 3D facial landmarks.

6 Limitations and Future Work

Our framework has a few limitations. Although many parametric head models do not contain eyeball mesh intrinsically (e.g., Facescape [48], NPHM [12], and BFM [11]), our experiments removed the eyeball mesh from FLAME and SMPL-X in preprocessing since Jacobians rely on the manifold mesh structure and continuous map**. A cage-based representation could offer the possibilities of deformations among multiple mesh instances [49], allowing for separately editable details like hair and accessories. Another minor limitation is that although we jointly optimize the geometry and texture, lighting is not disentangled from diffuse colors, which may add pixel-level noises and oversaturation artifacts on the albedo maps.

As Fig. 3 and Table 2 show, although our results achieve relatively good quantitative and qualitative results, our method cannot always generate corresponding geometry and texture. For example, eyeball appearance is produced only from the texture. Another future direction is to learn 3D head generation from a spherical mesh without strong avatar priors [34]. Designing an encoder to guide deformations with topology awareness may be necessary.

7 Conclusion

In this paper, we have introduced a text-guided generative framework for creating stylized 3D head avatars through learnable deformations. We use Jacobians as an intermediate representation for vertex displacements and improve the expressiveness of deformations by equip** Jacobians with learnable vector fields. As a result, such learnable deformations are embedded into score distillation for geometry and appearance generations. Besides deformation enhancement, proposed regularization terms help retain 3D attributes from the source mesh to support smooth reusability and editing by artists. Experiments on diverse text prompts demonstrate the significant performance of our method. Our method can potentially inform future research that better integrates neural generative techniques with practical graphics pipelines.

8 Acknowledgement

This project is sponsored by CCF-Tencent Rhino-Bird Open Research Fund RAGR20230120.

References

[1] Noam Aigerman, Kunal Gupta, Vladimir G. Kim, Siddhartha Chaudhuri, Jun Saito, and Thibault Groueix. Neural jacobian fields: Learning intrinsic map**s of arbitrary meshes. ACM Trans. Graph., 41(4), Jul 2022.
[2] Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Niessner. Clipface: Text-guided editing of textured 3d morphable models. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH ’23, New York, NY, USA, 2023. Association for Computing Machinery.
[3] Stephen W. Bailey, Dalton Omens, Paul Dilorenzo, and James F. O’Brien. Fast and deep facial deformations. ACM Trans. Graph., 39(4), Aug 2020.
[4] Volker Blanz and Thomas Vetter. A Morphable Model For The Synthesis Of 3D Faces. Association for Computing Machinery, New York, NY, USA, 1 edition, 2023.
[5] Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-Yee K Wong. Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023.
[6] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22246–22256, October 2023.
[7] Jiaxiang Cheng, Pan Xie, Xin Xia, Jiashi Li, Jie Wu, Yuxi Ren, Huixia Li, Xuefeng Xiao, Min Zheng, and Lean Fu. Resadapter: Domain consistent resolution adapter for diffusion models. ArXiv, abs/2403.02084, 2024.
[8] Isabel Fitton, Christopher Clarke, Jeremy Dalton, Michael J Proulx, and Christof Lutteroth. Dancing with the avatars: Minimal avatar customisation enhances learning in a psychomotor task. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery.
[9] Lin Gao, Jie Yang, Yi-Ling Qiao, Yu-Kun Lai, Paul L. Rosin, Weiwei Xu, and Shihong Xia. Automatic unpaired shape deformation transfer. ACM Trans. Graph., 37(6), Dec 2018.
[10] William Gao, Noam Aigerman, Thibault Groueix, Vova Kim, and Rana Hanocka. Textdeformer: Geometry manipulation using text guidance. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH ’23, New York, NY, USA, 2023. Association for Computing Machinery.
[11] Thomas Gerig, Andreas Morel-Forster, Clemens Blumer, Bernhard Egger, Marcel Luthi, Sandro Schoenborn, and Thomas Vetter. Morphable face models - an open framework. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 75–82, 2018.
[12] Simon Giebenhain, Tobias Kirschstein, Markos Georgopoulos, Martin Rünz, Lourdes Agapito, and Matthias Nießner. Learning neural parametric head models. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2023.
[13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.
[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commun. ACM, 63(11):139–144, oct 2020.
[15] Xiao Han, Yukang Cao, Kai Han, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, Tao Xiang, and Kwan-Yee K Wong. Headsculpt: Crafting 3d head avatars with text. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[16] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. volume 33, 2020.
[17] Andrew Hogue, Sunbir Gill, and Michael Jenkin. Automated avatar creation for 3d games. In Proceedings of the 2007 Conference on Future Play, Future Play ’07, page 174–180, New York, NY, USA, 2007. Association for Computing Machinery.
[18] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022.
[19] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022.
[20] Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. Avatar digitization from a single image for real-time rendering. ACM Trans. Graph., 36(6), Nov 2017.
[21] Yukun Huang, Jianan Wang, Ailing Zeng, He Cao, Xianbiao Qi, Yukai Shi, Zheng-Jun Zha, and Lei Zhang. Dreamwaltz: Make a scene with complex 3d animatable avatars, 2023.
[22] Sungwon Hwang, Junha Hyung, and Jaegul Choo. Text2control3d: Controllable 3d avatar generation in neural radiance fields using geometry-guided text-to-image diffusion model. arXiv preprint arXiv:2309.03550, 2023.
[23] Ruixiang Jiang, Can Wang, **gbo Zhang, Menglei Chai, Mingming He, Dongdong Chen, and **g Liao. Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14371–14382, October 2023.
[24] Adhik Joshi and Akash Guptag. Realistic vision 5.1. 2023. Accessed: 2024-05-10.
[25] Nikos Kolotouros, Thiemo Alldieck, Andrei Zanfir, Eduard Bazavan, Mihai Fieraru, and Cristian Sminchisescu. Dreamhuman: Animatable 3d avatars from text. volume 36, 2024.
[26] Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara, Owen Ingraham, Pengda Xiang, Xinglei Ren, Pratusha Prasad, Bipin Kishore, Jun Xing, and Hao Li. Learning formation of physically-based face attributes, 2020.
[27] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017.
[28] Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, and Yingcong Chen. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching, 2023.
[29] Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, and Michael J Black. Tada! text to animatable digital avatars. arXiv preprint arXiv:2308.10899, 2023.
[30] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
[31] Hongyu Liu, Xuan Wang, Ziyu Wan, Yujun Shen, Yibing Song, **g Liao, and Qifeng Chen. Headartist: Text-conditioned 3d head generation with self score distillation. arXiv preprint arXiv:2312.07539, 2023.
[32] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’87, page 163–169, New York, NY, USA, 1987. Association for Computing Machinery.
[33] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. Mediapipe: A framework for building perception pipelines. ArXiv, abs/1906.08172, 2019.
[34] Baptiste Nicolet, Alec Jacobson, and Wenzel Jakob. Large steps in inverse rendering of geometry. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 40(6), Dec 2021.
[35] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
[36] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2022.
[37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
[38] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
[39] Christian Schüller, Ladislav Kavan, Daniele Panozzo, and Olga Sorkine-Hornung. Locally Injective Map**s. Computer Graphics Forum, 2013.
[40] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems, 34:6087–6101, 2021.
[41] J. Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20875–20886, 2023.
[42] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
[43] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation, 2023.
[44] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, **g**g Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4573, 2023.
[45] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
[46] Xiaoying Wei, Yizheng Gu, Emily Kuang, Xian Wang, Beiyan Cao, Xiaofu **, and Mingming Fan. Bridging the generational gap: Exploring how virtual reality supports remote communication between grandparents and grandchildren. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery.
[47] Yuanyou Xu, Zongxin Yang, and Yi Yang. Seeavatar: Photorealistic text-to-3d avatar generation with constrained geometry and appearance. ArXiv, abs/2312.08889, 2023.
[48] Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, and Xun Cao. Facescape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[49] Wang Yifan, Noam Aigerman, Vladimir G. Kim, Siddhartha Chaudhuri, and Olga Sorkine-Hornung. Neural cages for detail-preserving 3d deformations. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 72–80, 2020.
[50] Kangxue Yin, Jun Gao, Maria Shugrina, Sameh Khamis, and Sanja Fidler. 3dstylenet: Creating 3d shapes with geometric and texture style variations. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
[51] Cuican Yu, Guansong Lu, Yihan Zeng, Jian Sun, Xiaodan Liang, Huibin Li, Zongben Xu, Songcen Xu, Wei Zhang, and Hang Xu. Towards high-fidelity text-guided 3d face generation and manipulation using only images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15326–15337, 2023.
[52] Yifei Zeng, Yuanxun Lu, Xinya Ji, Yao Yao, Hao Zhu, and Xun Cao. Avatarbooth: High-quality and customizable 3d human avatar generation, 2023.
[53] Chi Zhang, Yiwen Chen, Yijun Fu, Zhenglin Zhou, Gang Yu, Billzb Wang, Bin Fu, Tao Chen, Guosheng Lin, and Chunhua Shen. Styleavatar3d: Leveraging image-text diffusion models for high-fidelity 3d avatar generation. arXiv preprint arXiv:2305.19012, 2023.
[54] Hao Zhang, Yao Feng, Peter Kulits, Yandong Wen, Justus Thies, and Michael J Black. Text-guided generation and editing of compositional 3d avatars. arXiv preprint arXiv:2309.07125, 2023.
[55] Jianfeng Zhang, Xuanmeng Zhang, Huichao Zhang, Jun Hao Liew, Chenxu Zhang, Yi Yang, and Jiashi Feng. Avatarstudio: High-fidelity and animatable 3d avatar creation from text, 2023.
[56] Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang, Lan Xu, and **gyi Yu. Dreamface: Progressive generation of animatable 3d faces under text guidance. ACM Trans. Graph., 42(4), Jul 2023.
[57] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
[58] Mianlun Zheng, Yi Zhou, Duygu Ceylan, and Jernej Barbič. A deep emulator for secondary motion of 3d characters. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5928–5936, 2021.