HeadEvolver: Text to Head Avatars via Expressive
and Attribute-Preserving Mesh Deformation

Duotun Wang1∗, Hengyu Meng1,3∗, Zeyu Cai1, Zhi**g Shao1, Qianxi Liu1, Lin Wang1,4
Mingming Fan1,4, Xiaohang Zhan2, Zeyu Wang1,4
1The Hong Kong University of Science and Technology (Guangzhou) 2Tencent AI Lab
3South China University of Technology 4The Hong Kong University of Science and Technology
indicates equal contributions
Abstract

Current text-to-avatar methods often rely on implicit representations, leading to 3D content that artists cannot easily edit and animate in graphics software. This paper introduces a novel framework for generating stylized head avatars from text guidance, which leverages locally learnable mesh deformation and 2D diffusion priors to achieve high-quality digital assets for attribute-preserving manipulation. Given a template mesh, our method represents mesh deformation with per-face Jacobians and adaptively modulates local deformation using a learnable vector field. This vector field enables anisotropic scaling while preserving the rotation of vertices, which can better express identity and geometric details. We also employ landmark- and contour-based regularization terms to balance the expressiveness and plausibility of generated head avatars from multiple views without relying on any specific shape prior. Our framework can not only generate realistic shapes and textures that can be further edited via text, but also support seamless editing using the preserved attributes from the template mesh, such as 3DMM parameters, blendshapes, and UV coordinates. Extensive experiments demonstrate that our framework can generate diverse and expressive head avatars with high-quality meshes that artists can easily manipulate in 3D graphics software, facilitating downstream applications such as more efficient asset creation and animation with preserved attributes. Our project page is at: https://www.duotun-wang.co.uk/HeadEvolver/

1 Introduction

Head avatar modeling is an important and challenging task in visual computing due to its increasing need in applications such as games [17], movies [20], virtual reality [46], and online education [8]. Traditionally, creating a high-fidelity head avatar demands skilled technical artists to invest considerable time and effort into modeling and animation. While recent advances in AI-generated content (AIGC), particularly in text-to-3D, offer new avenues for accessible and cost-effective avatar creation, some practical gaps persist. For example, several methods [44, 56, 52] used generative models [16, 14, 42] to support text-guided human head generation. Nevertheless, these methods suffer from acquiring high-quality and diverse datasets for training. Another set of methods [15, 22, 54, 31] leveraged frozen large-scale vision language models (VLMs) to create 3D head avatars. These methods use 2D priors to provide multi-view guidance without requiring 3D training data and thus produce more diverse results.

However, there is a significant gap between current text-to-avatar methods and 3D asset creation, as most methods do not effectively integrate high fidelity, user customization, and animation support. This is mainly caused by using implicit shape representations that cannot preserve 3D attributes essential for downstream manipulation, e.g., mesh topology, rig, UV map**. For instance, HeadArtist [31] and HeadSculpt [15] leverage DMTet [40] to create head avatars with 3D Morphable Models (3DMMs) [4, 27] as strong shape priors. While this approach allows for a realistic appearance, the implicit shape representation DMTet used in these works tends to discard statistical information such as shape and expression parameters from 3DMMs and produce noisy meshes.

Head avatar generation using an explicit mesh representation would benefit downstream applications because of its modifiability and compatibility with graphics pipelines, although there is less existing research in this direction. A notable method is TADA [29], which incorporates the optimizations of vertex displacement and SMPL-X parameters [35]. This strategy preserves the semantics of 3DMMs and enables animatable 3D head avatar generation. However, direct vertex displacement tends to deform meshes with self-intersected faces and noisy normals, leading to limited mesh quality and animation artifacts. TextDeformer [10] can change an input mesh smoothly and stably by making deformation via Jacobians, but it cannot generate a textured appearance and lacks local geometry details.

This paper aims to mitigate the trade-offs between creating high-quality avatars and the ease of animation and editing without reconstructing blendshapes or rigs. We propose a novel text-to-avatar framework that produces stylized head avatars through expressive and attribute-preserving mesh deformation. Given a single template mesh such as FLAME [27], ICT-Face [26], or any manifold mesh preferably with facial semantics, our framework deforms it into a desired target shape while preserving feature correspondences such as skinny weights and UV coordinates for further animation and editing (Fig. LABEL:fig:teaser(c)).

The key observation is that although the mesh deformation via optimizing Jacobians supports smooth shape changes at a global scale [1], it falls short in fully expressing the semantics by optimizing image losses and performing locally anisotropic deformation. This is essential for generating diverse styles, such as the elongated nose in Fig. LABEL:fig:teaser(b). Furthermore, it is often laborious to tune deformation hyper-parameters to achieve both text-guided expressiveness and shape smoothness [34]. Therefore, we propose per-triangle learnable weighting factors coupled with Jacobians to enhance fine-grained deformation over the head mesh, which we call vector fields.

In order to build a complete 3D head avatar generation framework via gradient-based optimization, we employ a pre-trained diffusion model [36] with 3D awareness by leveraging a frozen landmarkguided ControlNet [57, 33]. Given a camera pose, we render shaded and normal images and obtain the opacity mask and the predefined vertex-indexed landmarks projected from the 3D head. As Fig. 1 illustrates, the stable diffusion model accepts text prompts, rendered images, and landmarks as additional conditions from ControlNet to compute the Score Distillation Sampling (SDS) loss. In addition, to mitigate the issues of unexpected geometry distortions or self-intersected triangles, we introduce landmark- and contour-based regularization terms to properly constrain the avatar deformation. For example, a good-shaped avatar is likely to be vertically symmetric. To summarize, we make the following contributions:

  • We introduce per-triangle vector fields to enhance the expressiveness of mesh deformation over local regions while preserving 3D attributes.

  • We propose a diffusion-based framework with avatar regularization terms to support text-to-avatar generation with high-quality mesh.

  • Extensive experiments demonstrate that our method can generate stylized 3D head avatars supporting asset creation with interactive editing and animation.

2 Related Work

Table 1: Compared to other text-to-avatar approaches, our method preserves predefined 3D attributes (e.g., vertex-indexed landmarks, rig, blendshapes), is independent of parametric models for joint optimization, produces mesh models of head avatars compatible with further manipulation in 3D software, and requires no training data nor fine-tuning.
3D Attribute Independence from Compatible with No Training Data
Preservation Parametric Models Graphics Pipelines nor Fine-tuning
HeadScuplt [15]
Fantasia3D [6]
TADA [29]
DreamFace [56]
Ours

Recently, there has been rapid research progress in digital human modeling and text-to-3D content generation. Our research centers on text-guided synthesis of head avatars employing an explicit mesh representation, aiming to fulfill four criteria outlined in Table 1: 1) preservation of predefined 3D attributes, 2) support for template mesh independent from parametric models, 3) seamless compatibility with graphics pipelines, and 4) gradient-based optimization without requiring training data nor fine-tuning. This section reviews related literature on text-to-3D avatar generation and learning-based mesh optimization.

2.1 Text-to-3D Avatar Generation

The success of text-driven generative models has sparked a wide range of research works in 3D content synthesis. These methods are based on different geometry representations such as NeRF [30, 45], tri-plane [41], mesh[6], and Gaussian Splatting [43, 28]. For text-to-avatar generation tasks, numerous efforts have been made to develop 3D generative models [44, 56, 51, 53] typically based on GAN [13] or diffusion model [16]. These methods often demand extensive 3D head data for training, incurring high costs. Moreover, they face challenges due to limited 3D datasets for creating diverse head avatars matching text prompts.

To bypass the extensive training process of 3D generative models and insufficient 3D data, recent studies [18, 55] have made remarkable advances by harnessing the capabilities of vision-language models (e.g., CLIP [37] or SDS [36]) to generate static human avatars from text prompts. DreamWaltz [21] aims to enhance the realism of avatar animations with pose-guided ControlNet [57]. Similarly, DreamHuman [25] employs a signed distance field conditioned on pose and shape parameters to empower animations with NeRF. However, extracting, editing, and animating an explicit mesh from these works are not straightforward [21] (e.g., use marching cube or tetrahedra algorithms) since NeRF and DMTet [40], for instance, represent shapes through network weights. Meanwhile, the issue of poor mesh quality is often concealed by texture, as the geometry and texture optimizations are fully mixed. Recent works [6, 15, 31] disentangle the learning of geometry and texture to improve the avatar mesh quality. Our work also leverages differentiable rendering, stable diffusion, and a hybrid optimization process for mesh and texture synthesis, specifically emphasizing the deformation of an explicit geometry template through semantic guidance instead of generation from scratch or using implicit representations.

2.2 Learning-Based Mesh Optimization

Parametric mesh templates have been a prominent research focus for human avatar reconstruction and generation tasks, as they are well-suited for deep learning approaches. AvatarClip [19] and CLIP-Face [2] employ parametric models such as SMPL-X [35] and FLAME [27] as geometry prior to produce text-aligned avatars. DreamAvatar [5] extracts the shape parameters of SMPL as a 3D prior to learn a NeRF-based color field. HeadSculpt [15] and HeadArtist [31] utilize the FLAME template as canonical shape input and extract its facial landmark to guide avatar generations via DMTet. However, meshes extracted from implicit representations tend to lose semantic information of 3DMMs [4], posing challenges for downstream applications such as interactive editing and animation.

For manipulating geometry through deformation, recent works propose data-driven approaches [3, 1] to predict realistic deformation and some methods focus on learning deformation over one template mesh through tuning per-vertex coordinates [50, 58, 9]. TADA [29] directly optimizes vertex displacements along with the SMPL-X’s pose and shape parameters and achieves animatable avatars. Nonetheless, this characteristic is highly dependent on the quality of 3DMMs. Straightly optimization of non-convex mesh structures often converges to undesirable local minima, resulting in noticeable artifacts on surface details. TextDeformer [10] initially integrates CLIP into a learning-based deformation pipeline to address the scarcity of 3D datasets that pair shapes with text semantics. Our approach extends this by optimizing local semantic deformations on any template mesh, independent of its shape, expression, or blendshapes [4]. Our goal is to enhance identity expressiveness guided by text. The resulting head avatars can be animated using the original rigging from the source mesh and seamlessly integrated into existing rendering and editing workflows for design and artistic purposes.

3 Methodology

Refer to caption
Figure 1: Framework overview. We deform a template mesh by optimizing per-triangle vector fields guided by a text prompt. Rendered normal and RGB images coupled with MediaPipe landmarks are fed into a diffusion model to compute respective losses. Our regularization of Jacobians controls the fidelity and semantics of facial features that conform to text guidance.

In this section, we describe the details of our generation pipeline, as shown in Fig. 1. Given a source template mesh, we jointly optimize the mesh deformation and an albedo map guided by a text prompt. Our approach is not limited to a specific parametric model (e.g., FLAME) and can be applied to any manifold mesh. Figs. 15 and 16 show the generalization of our approach.

Refer to caption
Figure 2: Visualization of vector field Hfsubscript𝐻𝑓H_{f}italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT through Wfsubscript𝑊𝑓W_{f}italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Orange colors indicate strong deformations and Wf=Isubscript𝑊𝑓𝐼W_{f}=Iitalic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_I refers to no deformation enhancement.
Refer to caption
Figure 3: Qualitative comparisons. Our method preserves the template mesh’s 3D attributes, yielding diverse and high-quality mesh models.

3.1 Mesh Deformation through Jacobian Fields

Let =(𝒱,)𝒱\mathcal{M}=(\mathcal{V},\mathcal{F})caligraphic_M = ( caligraphic_V , caligraphic_F ) denote a source template mesh which consists of a set of vertices 𝒱n×3𝒱superscript𝑛3\mathcal{V}\in\mathbb{R}^{n\times 3}caligraphic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × 3 end_POSTSUPERSCRIPT and faces m×3superscript𝑚3\mathcal{F}\in\mathbb{Z}^{m\times 3}caligraphic_F ∈ blackboard_Z start_POSTSUPERSCRIPT italic_m × 3 end_POSTSUPERSCRIPT. For the optimization strategy of mesh deformation, directly applying displacement to each vertex can result in severe self-interactions of triangle faces since it is susceptible to localized noisy gradients from pixel-level losses. Instead, we parameterize deformation map** Φ:n×3n×3:Φsuperscript𝑛3superscript𝑛3\Phi:\mathbb{R}^{n\times 3}\to\mathbb{R}^{n\times 3}roman_Φ : blackboard_R start_POSTSUPERSCRIPT italic_n × 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n × 3 end_POSTSUPERSCRIPT by employing a Jacobian field [1] J={Jf|f}𝐽conditional-setsubscript𝐽𝑓𝑓J=\{J_{f}|f\in\mathcal{F}\}italic_J = { italic_J start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | italic_f ∈ caligraphic_F } that represents the scaling and rotation of each mesh face. A dual representation of \mathcal{M}caligraphic_M can be defined as a stack of per-face Jacobians Jf3×3subscript𝐽𝑓superscript33J_{f}\in\mathbb{R}^{3\times 3}italic_J start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT where

Jf=f𝒱,subscript𝐽𝑓subscript𝑓𝒱J_{f}=\nabla_{f}\mathcal{V},italic_J start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT caligraphic_V , (1)

and fsubscript𝑓\nabla_{f}∇ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the gradient operator of triangle f𝑓fitalic_f. Conversely, we aim to optimize each vertex’s position from the per-face Jacobian.

After obtaining optimized Jacobians in each iteration, we can compute the deformation map** ΦΦ\Phiroman_Φ over a set of vertices conforming to the source mesh topology by solving a Poisson problem, i.e.,

Φ=argminΦf|f|f(Φ)Jf22,superscriptΦargsubscriptΦsubscript𝑓𝑓subscriptsuperscriptdelimited-∥∥subscript𝑓Φsubscript𝐽𝑓22\Phi^{\ast}=\text{arg}\min_{\Phi}\sum_{f\in\mathcal{F}}|f|\lVert\nabla_{f}(% \Phi)-J_{f}\rVert^{2}_{2},roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = arg roman_min start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | italic_f | ∥ ∇ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( roman_Φ ) - italic_J start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (2)

where f(Φ)subscript𝑓Φ\nabla_{f}(\Phi)∇ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( roman_Φ ) is the Jacobian of ΦΦ\Phiroman_Φ at a triangle face f𝑓fitalic_f and |f|𝑓|f|| italic_f | is the area of the face. Thus, the deformation map ΦΦ\Phiroman_Φ over the input template mesh is indirectly optimized through Jacobians Jfsubscript𝐽𝑓J_{f}italic_J start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT in the least square sense. The solution can be obtained by using a differentiable solver [1, 10] implementing Cholesky decomposition:

Φ=L1T𝒜J,superscriptΦsuperscript𝐿1superscript𝑇𝒜𝐽\Phi^{\ast}=L^{-1}\nabla^{T}\mathcal{A}J,roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_L start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_A italic_J , (3)

where \nabla is a stack of per-face gradient operators, 𝒜3m×3m𝒜superscript3𝑚3𝑚\mathcal{A}\in\mathbb{R}^{3m\times 3m}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_m × 3 italic_m end_POSTSUPERSCRIPT is the mass matrix of \mathcal{M}caligraphic_M, and L3n×3n𝐿superscript3𝑛3𝑛L\in\mathbb{R}^{3n\times 3n}italic_L ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_n × 3 italic_n end_POSTSUPERSCRIPT is the cotangent Laplacian of \mathcal{M}caligraphic_M. This learnable shape representation based on Jacobians is robust to noisy gradients. It can achieve better geometry quality (e.g., fewer self-intersections and noisy normals) than direct vertex displacements, which is further justified in Sec. 4.

However, during our experiment, we observed that the poisson solver applied in Eq. 2 can impose strong geometry constraints for Jacobians to reach the global-coherence mesh deformation [10], which may lose fine-grain details for local shape features and compromise the identity of the generated head mesh (Fig. 5). A straightforward approach is to increase the learning rate for Jacobians to enhance their sensitivities to propagated gradients. This amplifies the rotation and scaling of each face, resulting in stronger deformations (Sec. 4.3). However, intensified rotations raise the likelihood of inverted triangles [39], leading to an unstable optimization process and a poor-quality mesh. To address these concerns, we introduce a learnable per-face vector field H={Hf|f}𝐻conditional-setsubscript𝐻𝑓𝑓H=\{H_{f}|f\in\mathcal{F}\}italic_H = { italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | italic_f ∈ caligraphic_F } to enhance the deformation expressiveness of Jacobian field J𝐽Jitalic_J, i.e.,

Wf=diag(wx,wy,wz),Hf=WfJf.formulae-sequencesubscript𝑊𝑓𝑑𝑖𝑎𝑔subscript𝑤𝑥subscript𝑤𝑦subscript𝑤𝑧subscript𝐻𝑓subscript𝑊𝑓subscript𝐽𝑓\begin{split}W_{f}&=diag(w_{x},w_{y},w_{z}),\\ H_{f}&=W_{f}J_{f}.\end{split}start_ROW start_CELL italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_CELL start_CELL = italic_d italic_i italic_a italic_g ( italic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_CELL start_CELL = italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT . end_CELL end_ROW (4)

By substituting H𝐻Hitalic_H into Eq. 3, the deformation map** could be computed by

Φ=L1T𝒜H.superscriptΦsuperscript𝐿1superscript𝑇𝒜𝐻\Phi^{\ast}=L^{-1}\nabla^{T}\mathcal{A}H.roman_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_L start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_A italic_H . (5)

The effect of W={Wf|f}𝑊conditional-setsubscript𝑊𝑓𝑓W=\{W_{f}|f\in\mathcal{F}\}italic_W = { italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | italic_f ∈ caligraphic_F } is to control the anisotropic scaling (e.g., stretching and shrinking in different directions) on each face. As shown in Fig. 2, it can adaptively learn the degree of anisotropic scaling on the corresponding triangular faces that can represent the character’s features. Thus, the per-face vector field H𝐻Hitalic_H can prioritize the Poisson solver to approach the desired deformations to express the identity of head avatars more effectively than applying Jacobians only. The detailed validation is in Sec. 4.3.

3.2 Text-Guided 3D Synthesis

We aim to build a text-guided head avatar generation framework based on differentiable mesh deformation and texture generation. Motivated by recent success in large-scale VLMs and text-to-3D literature discussed in Sec. 2.1, we utilize a powerful 2D diffusion prior (i.e., SDS [36]) to direct deformations by scoring the rendering realism of the generated shape in our framework.

Specifically, by following the procedure described in Sec. 3.1, we denote 𝒱superscript𝒱\mathcal{V^{*}}caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as the set of deformed vertices from J𝐽Jitalic_J. Then, by using a differentiable render 𝐑𝐑\mathbf{R}bold_R, we render =(𝒱,)superscriptsuperscript𝒱\mathcal{M^{*}}=(\mathcal{V}^{*},\mathcal{F})caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_F ) from a viewpoint delineated by camera extrinsic parameter C𝐶Citalic_C, which produces an image I=𝐑(,C)𝐼𝐑superscript𝐶I=\mathbf{R}(\mathcal{M^{*}},C)italic_I = bold_R ( caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_C ). The SDS loss is leveraged to alter the geometry and color respectively, that is:

JSDS(ϵ,I)=𝔼t,ϵ[w(t)((ϵθ(zt;y,C,t)ϵ)IJ],\footnotesize\nabla_{J}\mathcal{L}_{\mathrm{SDS}}(\epsilon,I)=\mathbb{E}_{t,% \epsilon}\left[w(t)((\epsilon_{\theta}(z_{t};y,C,t)-\epsilon)\frac{\partial{I}% }{\partial{J}}\right],∇ start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_SDS end_POSTSUBSCRIPT ( italic_ϵ , italic_I ) = blackboard_E start_POSTSUBSCRIPT italic_t , italic_ϵ end_POSTSUBSCRIPT [ italic_w ( italic_t ) ( ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_y , italic_C , italic_t ) - italic_ϵ ) divide start_ARG ∂ italic_I end_ARG start_ARG ∂ italic_J end_ARG ] , (6)

where ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ), ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the pre-trained diffusion priors with the network parameter θ𝜃\thetaitalic_θ, w(t)𝑤𝑡w(t)italic_w ( italic_t ) is a weight coefficient related to timestep t𝒰(0,1)similar-to𝑡𝒰01t\sim\mathcal{U}(0,1)italic_t ∼ caligraphic_U ( 0 , 1 ), y𝑦yitalic_y is the text input condition, ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes a latent embedding of I𝐼Iitalic_I perturbed with noises at time step t𝑡titalic_t, and C𝐶Citalic_C is the MediaPipe [33] facial landmark map. Rendered image I𝐼Iitalic_I can be obtained from the normal Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and shaded Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (Fig. 1).

To ensure deformation map** from propagated gradients and the Poisson solver can conform to the facial topology and reasonable semantics of the source mesh, we further develop two regularization terms, aiming to preserve 3D attributes of generated head avatars (Figs. 12 and 13).

Landmark regularization. The landmark regularization measures the difference between predefined N𝑁Nitalic_N 3D face landmarks 𝐤i3subscript𝐤𝑖superscript3\mathbf{k}_{i}\in\mathbb{R}^{3}bold_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT on the canonical template mesh \mathcal{M}caligraphic_M and the corresponding ones on the deformed mesh 𝐤i3subscriptsuperscript𝐤𝑖superscript3\mathbf{k}^{\prime}_{i}\in\mathbb{R}^{3}bold_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT after projection. The landmark regularization loss is defined as lmk=1Ni=1N𝐤i𝐤i22subscript𝑙𝑚𝑘1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscriptnormsubscript𝐤𝑖subscriptsuperscript𝐤𝑖22\mathcal{L}_{lmk}=\frac{1}{N}\sum_{i=1}^{N}||\mathbf{k}_{i}-\mathbf{k}^{\prime% }_{i}||^{2}_{2}caligraphic_L start_POSTSUBSCRIPT italic_l italic_m italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | bold_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Evolving contour-based regularization. Inspired by AvatarCraft [23], we incorporate an opacity map O𝑂Oitalic_O to preserve a plausible head shape while minimizing excessive or undesired vertex displacement. To balance diverse geometry shapes and constraints from severe deformations, the source template mesh is periodically updated from deformed mesh every 300 steps [47]. The pixel-level loss is parameterized as op=1HWH,WOO22subscript𝑜𝑝1𝐻𝑊subscript𝐻𝑊subscriptsuperscriptnorm𝑂superscript𝑂22\mathcal{L}_{op}=\frac{1}{HW}\sum_{H,W}||O-O^{\prime}||^{2}_{2}caligraphic_L start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_H , italic_W end_POSTSUBSCRIPT | | italic_O - italic_O start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where H𝐻Hitalic_H and W𝑊Witalic_W represent the height and width of the opacity mask of the template mesh O𝑂Oitalic_O and the optimized head avatar Osuperscript𝑂O^{\prime}italic_O start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The total loss is defined as follows:

=λ1ΦSDS+λ2lmk+λ3op,subscript𝜆1subscriptΦsubscriptSDSsubscript𝜆2subscript𝑙𝑚𝑘subscript𝜆3subscript𝑜𝑝\mathcal{L}=\lambda_{1}\nabla_{\Phi}\mathcal{L}_{\mathrm{SDS}}+\lambda_{2}% \mathcal{L}_{lmk}+\lambda_{3}\mathcal{L}_{op},\vspace{-0.05cm}caligraphic_L = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_SDS end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_l italic_m italic_k end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_o italic_p end_POSTSUBSCRIPT , (7)

where λ1=1subscript𝜆11\lambda_{1}=1italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, λ2=200subscript𝜆2200\lambda_{2}=200italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 200, and λ3=250subscript𝜆3250\lambda_{3}=250italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 250 are the hyper-parameters in our experiment settings.

4 Experiments

In this section, we present experiments to evaluate the efficacy of our method both quantitatively and qualitatively for text-to-avatar creation. Then, we demonstrate the results of the ablation study that validate the significance of our key designs for vector field and regularization terms in Sec. 4.3.

Implementation details. For geometry and texture generation, we render normal and shaded images at a resolution of 512×512512512512\times 512512 × 512 pixels and feed them into the Realistic Vision 5.1 (RV 5.1) [24]. Compared with Stable Diffusion 2.1 [38], we find that RV 5.1 supports a more appealing avatar appearance. The multi-face Janus and texture misalignment problems are further alleviated by introducing ControlNetMediaPipeFace [57]. The generation process runs on a single NVIDIA RTX 4090409040904090 GPU with 20,0002000020,00020 , 000 iterations per avatar, which takes around 70707070 minutes. ResAdapter [7] is further utilized to generate the albedo map with a 1024 ×\times× 1024 resolution. Negative prompts are directly leveraged to enhance the realism and details of the texture, e.g., replacing the empty prompt in classifier-free guidance (CFG) with “unrealistic,” “blurry,” “low quality,” “saturated,” “low contrast,” etc. In addition to the proposed regularization terms, we enforce symmetry along the y-axis in each iteration to constrain the overall position offsets of the generated mesh.

4.1 Qualitative Evaluation

Our evaluation compares two explicit mesh manipulation methods from TADA [29] and TextDeformer [10], and one implicit method from Fantasia3D [6] as baselines. We do not compare with HeadArtist [31] and DreamFace [56] due to their lack of open-sourcing and primary focus on rendering quality. Instead, our framework emphasizes on generating high-quality mesh topologies. We utilized FLAME’s neutral expression as a unified input template mesh to conduct experiments. Visual comparison results are in Fig. 3. TADA and Fantasia3D demonstrate their capabilities of detail-preserving geometry with 3D head priors. However, their results still suffer from geometric artifacts in semantically dense areas, such as noisy vertices around eye regions, although this is less noticeable in a textured rendering. Our method mitigates this issue with the spatial alignments guided by landmarks and opacity regularization terms. Compared to TextDeformer, our approach exhibits similar mesh quality and amplifies facial characteristics following text specifications via proposed vector fields.

In summary, our framework generates expressive human head meshes that inherit facial attributes. The texture generation aligns with the geometry to create 3D assets ready for graphics applications. As depicted in Fig. 8 and Fig. 9, with text guidance, generated models from our method could be effectively manipulated to change appearance. Fig. LABEL:fig:teaser(e) and Fig. 14 illustrate the potential for seamless utilization of these avatars in 3D software by artists, as they exhibit distinct facial features and fewer artifacts.

Table 2: Quantitative comparison of state-of-the-art methods. The geometry CLIP score is measured on the shading images rendered with uniform albedo colors, the appearance CLIP score is evaluated on the images rendered with textures, and the self-intersection is quantified as a ratio between the number of self-intersections in the deformed mesh and the number of faces. \uparrow indicates higher is better and \downarrow indicates lower is better.
Metric Ours TADA TextDeformer Fantasia3D
Geometry CLIP \uparrow 0.2506 0.24750.24750.24750.2475 0.24620.24620.24620.2462 0.18150.18150.18150.1815
Appearance CLIP \uparrow 0.2955 0.29310.29310.29310.2931 - 0.27180.27180.27180.2718
Self-Intersection \downarrow 2.08% 14.22%percent14.2214.22\%14.22 % 2.19%percent2.192.19\%2.19 % -

4.2 Quantitative Evaluation

We quantitatively evaluated our framework in generation consistency with text input and generation quality.

Generation Consistency with Text. We first evaluated the relevance of generated results to the corresponding text descriptions by calculating the CLIP score [37]. In addition, to separate the geometric attributes from the influence of texture, we rendered the generated geometry with a uniform albedo. Subsequently, we calculated the geometry CLIP score as a supplementary measure for evaluating the generation consistency of geometry with text prompts. We rendered 20 distinct views and computed the average scores respectively, as shown in Table 2. Our approach achieves the highest scores compared to the baseline methods.

Mesh Quality. We calculated the metric of self-intersection to measure mesh quality, which is the ratio of the number of self-intersections over the number of faces in the deformed mesh. Since the mesh from Fantasia3D is extracted through the marching cubes algorithm [32], it may produce skinny triangles containing small edges and angles with no self-intersections. Therefore, we only compared our method with TADA and TextDeformer. Unlike our method, geometry manipulation through direct vertex displacement tends to converge to a local minimum and disregards the triangulation of the shape.

User Study. We conducted a user study to evaluate the robustness and expressiveness of our method with 10 distinct text prompts, 4 of which are from TADA. We used a Google Form to assess 1) geometry quality, 2) texture quality, and 3) consistency with text. We recruited 46 participants, of whom 26 are graduate students majoring in computer graphics and vision, and 20 are company employees specializing in AI content generation. In this form, the participants were instructed to choose the preferred renderings of head avatars from different methods in randomized order, as shown in Table 3. The results show that participants preferred our method by a significant margin compared with the baseline methods in all three metrics.

Table 3: User evaluation of generated head avatars. Compared to other methods, ours is preferred by over 65% users in terms of geometry quality, texture quality, and consistency with text.
Preference \uparrow Ours TADA TextDeformer Fantasia3D
Geometry Quality 65.9% 3.9%percent3.93.9\%3.9 % 29.8%percent29.829.8\%29.8 % 0.4%percent0.40.4\%0.4 %
Texture Quality 65.2% 31.9%percent31.931.9\%31.9 % - 2.9%percent2.92.9\%2.9 %
Consistency with Text 66.9% 28.9%percent28.928.9\%28.9 % 2.3%percent2.32.3\%2.3 % 1.9%percent1.91.9\%1.9 %

4.3 Ablation Study

Effect of Vector field. We conducted three sets of qualitative comparisons to demonstrate the effectiveness of the per-triangle vector field H𝐻Hitalic_H. As shown in Fig. 5, using Jacobian J𝐽Jitalic_J alone tends to overly smooth the gradients, leading to a loss of distinctive character features. Although increasing the learning rate of J𝐽Jitalic_J highlights character identities, generated meshes are distorted and suffer from severe self-intersections. Furthermore, our experiments introduce the scalar field S𝑆Sitalic_S which represents the isotropic scaling by setting diagonal elements of each Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to equal values. The results indicate that anisotropic scaling guided through H𝐻Hitalic_H generates avatars bearing more geometry details and character features than those through S𝑆Sitalic_S (e.g., the eyebrow of ”Octopus Alien” and the ear of ”Goblin”). Furthermore, we demonstrate the tuning process of the learning rate (LR) to evaluate that the addition of H𝐻Hitalic_H effectively enhances the stability and expressiveness of deformations when the LR of J𝐽Jitalic_J is set in the reasonable range (e.g., between 0.010.010.010.01 to 0.0250.0250.0250.025). As shown in Fig. 4, we provide the best LR settings, which are 0.0150.0150.0150.015 for H𝐻Hitalic_H and 0.01 for J𝐽Jitalic_J respectively. We also demonstrate the generality of vector field by applying it to the object generation framework (Fig. 16) and observe that the use of H𝐻Hitalic_H can make optimization converge to desired deformation states faster than using Jacobians only.

Regularization. In deformation optimization, facial features and head poses may deviate unexpectedly. Landmark regularization constrains generated meshes to the input shape’s canonical space, and contour-based regularization contributes to reasonable head shapes. We demonstrate the effectiveness of our regularization designs with another ablation study, as shown in Fig. 6.

Refer to caption
Figure 4: Learning rate (LR) tuning. Supported by our experiments, adding per-triangle H𝐻Hitalic_H through W𝑊Witalic_W can enhance the identity and character features of the avatar from the text prompt. We also observe the best LR of W𝑊Witalic_W could be 0.0150.0150.0150.015, and larger LRs can cause unexpected severe deformations.
Refer to caption
Figure 5: Ablation study on Vector Fields. The per-triangle H𝐻Hitalic_H can enhance the identity and character features of the avatar.
Refer to caption
Figure 6: Ablation study results on regularization. Introducing an opacity map and landmark regularization contributes to maintaining symmetric facial features and aligning them with the pose of the input mesh.

5 Applications

Our method can produce head avatars that can be smoothly manipulated (e.g., animated via 3DMMs and text-based editing) and are compatible with graphics tools such as Blender and Maya. Compared with implicit representations, our method enables artists to interactively create, edit, and animate digital head avatar assets for various downstream applications.

Local geometry editing. Besides manipulating generated meshes in 3D software as artists usually do, our approach also supports efficient head editing with text descriptions (Figs. LABEL:fig:teaser(b) and  8). Only Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on certain parts is optimized given the text guidance.

Texture editing through text guidance. Our approach encourages independent texture generation on a fixed geometry, which enables texture editing specified by the text input (Fig. 9). Some unexpected color artifacts may be introduced due to the noises from the stable diffusion model.

Avatar animation. As shown in Figs. LABEL:fig:teaser(c) and  10, we demonstrate that the rigging and blendshapes of the head mesh model after deformation are well-preserved for creating head animations and shape manipulations.

Avatar generation from other template meshes. As described in Sec. 3, our approach is not constrained by specific mesh models (e.g., FLAME) as long as they are manifold mesh. Fig. 15 shows an example of generating a “Superman” head from crafted 3D assets even without labeling 3D facial landmarks.

6 Limitations and Future Work

Our framework has a few limitations. Although many parametric head models do not contain eyeball mesh intrinsically (e.g., Facescape [48], NPHM [12], and BFM [11]), our experiments removed the eyeball mesh from FLAME and SMPL-X in preprocessing since Jacobians rely on the manifold mesh structure and continuous map**. A cage-based representation could offer the possibilities of deformations among multiple mesh instances [49], allowing for separately editable details like hair and accessories. Another minor limitation is that although we jointly optimize the geometry and texture, lighting is not disentangled from diffuse colors, which may add pixel-level noises and oversaturation artifacts on the albedo maps.

As Fig. 3 and Table 2 show, although our results achieve relatively good quantitative and qualitative results, our method cannot always generate corresponding geometry and texture. For example, eyeball appearance is produced only from the texture. Another future direction is to learn 3D head generation from a spherical mesh without strong avatar priors [34]. Designing an encoder to guide deformations with topology awareness may be necessary.

7 Conclusion

In this paper, we have introduced a text-guided generative framework for creating stylized 3D head avatars through learnable deformations. We use Jacobians as an intermediate representation for vertex displacements and improve the expressiveness of deformations by equip** Jacobians with learnable vector fields. As a result, such learnable deformations are embedded into score distillation for geometry and appearance generations. Besides deformation enhancement, proposed regularization terms help retain 3D attributes from the source mesh to support smooth reusability and editing by artists. Experiments on diverse text prompts demonstrate the significant performance of our method. Our method can potentially inform future research that better integrates neural generative techniques with practical graphics pipelines.

8 Acknowledgement

This project is sponsored by CCF-Tencent Rhino-Bird Open Research Fund RAGR20230120.

References

  • [1] Noam Aigerman, Kunal Gupta, Vladimir G. Kim, Siddhartha Chaudhuri, Jun Saito, and Thibault Groueix. Neural jacobian fields: Learning intrinsic map**s of arbitrary meshes. ACM Trans. Graph., 41(4), Jul 2022.
  • [2] Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Niessner. Clipface: Text-guided editing of textured 3d morphable models. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  • [3] Stephen W. Bailey, Dalton Omens, Paul Dilorenzo, and James F. O’Brien. Fast and deep facial deformations. ACM Trans. Graph., 39(4), Aug 2020.
  • [4] Volker Blanz and Thomas Vetter. A Morphable Model For The Synthesis Of 3D Faces. Association for Computing Machinery, New York, NY, USA, 1 edition, 2023.
  • [5] Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, and Kwan-Yee K Wong. Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models. arXiv preprint arXiv:2304.00916, 2023.
  • [6] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22246–22256, October 2023.
  • [7] Jiaxiang Cheng, Pan Xie, Xin Xia, Jiashi Li, Jie Wu, Yuxi Ren, Huixia Li, Xuefeng Xiao, Min Zheng, and Lean Fu. Resadapter: Domain consistent resolution adapter for diffusion models. ArXiv, abs/2403.02084, 2024.
  • [8] Isabel Fitton, Christopher Clarke, Jeremy Dalton, Michael J Proulx, and Christof Lutteroth. Dancing with the avatars: Minimal avatar customisation enhances learning in a psychomotor task. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  • [9] Lin Gao, Jie Yang, Yi-Ling Qiao, Yu-Kun Lai, Paul L. Rosin, Weiwei Xu, and Shihong Xia. Automatic unpaired shape deformation transfer. ACM Trans. Graph., 37(6), Dec 2018.
  • [10] William Gao, Noam Aigerman, Thibault Groueix, Vova Kim, and Rana Hanocka. Textdeformer: Geometry manipulation using text guidance. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  • [11] Thomas Gerig, Andreas Morel-Forster, Clemens Blumer, Bernhard Egger, Marcel Luthi, Sandro Schoenborn, and Thomas Vetter. Morphable face models - an open framework. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 75–82, 2018.
  • [12] Simon Giebenhain, Tobias Kirschstein, Markos Georgopoulos, Martin Rünz, Lourdes Agapito, and Matthias Nießner. Learning neural parametric head models. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014.
  • [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commun. ACM, 63(11):139–144, oct 2020.
  • [15] Xiao Han, Yukang Cao, Kai Han, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, Tao Xiang, and Kwan-Yee K Wong. Headsculpt: Crafting 3d head avatars with text. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [16] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. volume 33, 2020.
  • [17] Andrew Hogue, Sunbir Gill, and Michael Jenkin. Automated avatar creation for 3d games. In Proceedings of the 2007 Conference on Future Play, Future Play ’07, page 174–180, New York, NY, USA, 2007. Association for Computing Machinery.
  • [18] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022.
  • [19] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022.
  • [20] Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. Avatar digitization from a single image for real-time rendering. ACM Trans. Graph., 36(6), Nov 2017.
  • [21] Yukun Huang, Jianan Wang, Ailing Zeng, He Cao, Xianbiao Qi, Yukai Shi, Zheng-Jun Zha, and Lei Zhang. Dreamwaltz: Make a scene with complex 3d animatable avatars, 2023.
  • [22] Sungwon Hwang, Junha Hyung, and Jaegul Choo. Text2control3d: Controllable 3d avatar generation in neural radiance fields using geometry-guided text-to-image diffusion model. arXiv preprint arXiv:2309.03550, 2023.
  • [23] Ruixiang Jiang, Can Wang, **gbo Zhang, Menglei Chai, Mingming He, Dongdong Chen, and **g Liao. Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14371–14382, October 2023.
  • [24] Adhik Joshi and Akash Guptag. Realistic vision 5.1. 2023. Accessed: 2024-05-10.
  • [25] Nikos Kolotouros, Thiemo Alldieck, Andrei Zanfir, Eduard Bazavan, Mihai Fieraru, and Cristian Sminchisescu. Dreamhuman: Animatable 3d avatars from text. volume 36, 2024.
  • [26] Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara, Owen Ingraham, Pengda Xiang, Xinglei Ren, Pratusha Prasad, Bipin Kishore, Jun Xing, and Hao Li. Learning formation of physically-based face attributes, 2020.
  • [27] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017.
  • [28] Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, and Yingcong Chen. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching, 2023.
  • [29] Tingting Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang, Yangyi Huang, Justus Thies, and Michael J Black. Tada! text to animatable digital avatars. arXiv preprint arXiv:2308.10899, 2023.
  • [30] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  • [31] Hongyu Liu, Xuan Wang, Ziyu Wan, Yujun Shen, Yibing Song, **g Liao, and Qifeng Chen. Headartist: Text-conditioned 3d head generation with self score distillation. arXiv preprint arXiv:2312.07539, 2023.
  • [32] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’87, page 163–169, New York, NY, USA, 1987. Association for Computing Machinery.
  • [33] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. Mediapipe: A framework for building perception pipelines. ArXiv, abs/1906.08172, 2019.
  • [34] Baptiste Nicolet, Alec Jacobson, and Wenzel Jakob. Large steps in inverse rendering of geometry. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 40(6), Dec 2021.
  • [35] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  • [36] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2022.
  • [37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
  • [38] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
  • [39] Christian Schüller, Ladislav Kavan, Daniele Panozzo, and Olga Sorkine-Hornung. Locally Injective Map**s. Computer Graphics Forum, 2013.
  • [40] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. Advances in Neural Information Processing Systems, 34:6087–6101, 2021.
  • [41] J. Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20875–20886, 2023.
  • [42] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020.
  • [43] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation, 2023.
  • [44] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, **g**g Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4573, 2023.
  • [45] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  • [46] Xiaoying Wei, Yizheng Gu, Emily Kuang, Xian Wang, Beiyan Cao, Xiaofu **, and Mingming Fan. Bridging the generational gap: Exploring how virtual reality supports remote communication between grandparents and grandchildren. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  • [47] Yuanyou Xu, Zongxin Yang, and Yi Yang. Seeavatar: Photorealistic text-to-3d avatar generation with constrained geometry and appearance. ArXiv, abs/2312.08889, 2023.
  • [48] Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, and Xun Cao. Facescape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [49] Wang Yifan, Noam Aigerman, Vladimir G. Kim, Siddhartha Chaudhuri, and Olga Sorkine-Hornung. Neural cages for detail-preserving 3d deformations. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 72–80, 2020.
  • [50] Kangxue Yin, Jun Gao, Maria Shugrina, Sameh Khamis, and Sanja Fidler. 3dstylenet: Creating 3d shapes with geometric and texture style variations. In Proceedings of International Conference on Computer Vision (ICCV), 2021.
  • [51] Cuican Yu, Guansong Lu, Yihan Zeng, Jian Sun, Xiaodan Liang, Huibin Li, Zongben Xu, Songcen Xu, Wei Zhang, and Hang Xu. Towards high-fidelity text-guided 3d face generation and manipulation using only images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15326–15337, 2023.
  • [52] Yifei Zeng, Yuanxun Lu, Xinya Ji, Yao Yao, Hao Zhu, and Xun Cao. Avatarbooth: High-quality and customizable 3d human avatar generation, 2023.
  • [53] Chi Zhang, Yiwen Chen, Yijun Fu, Zhenglin Zhou, Gang Yu, Billzb Wang, Bin Fu, Tao Chen, Guosheng Lin, and Chunhua Shen. Styleavatar3d: Leveraging image-text diffusion models for high-fidelity 3d avatar generation. arXiv preprint arXiv:2305.19012, 2023.
  • [54] Hao Zhang, Yao Feng, Peter Kulits, Yandong Wen, Justus Thies, and Michael J Black. Text-guided generation and editing of compositional 3d avatars. arXiv preprint arXiv:2309.07125, 2023.
  • [55] Jianfeng Zhang, Xuanmeng Zhang, Huichao Zhang, Jun Hao Liew, Chenxu Zhang, Yi Yang, and Jiashi Feng. Avatarstudio: High-fidelity and animatable 3d avatar creation from text, 2023.
  • [56] Longwen Zhang, Qiwei Qiu, Hongyang Lin, Qixuan Zhang, Cheng Shi, Wei Yang, Ye Shi, Sibei Yang, Lan Xu, and **gyi Yu. Dreamface: Progressive generation of animatable 3d faces under text guidance. ACM Trans. Graph., 42(4), Jul 2023.
  • [57] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  • [58] Mianlun Zheng, Yi Zhou, Duygu Ceylan, and Jernej Barbič. A deep emulator for secondary motion of 3d characters. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5928–5936, 2021.
Refer to caption
Figure 7: More results for qualitative comparison. For each method, we show the meshes with diffuse texture, without texture, and zoomed in. Compared to other methods, we can generate head avatars with high-quality geometry and appearance.
Refer to caption
Figure 8: Examples of local geometry editing. Our framework can use per-vertex segmentation information shown in Fig. 13, enabling text-based mesh editing that changes avatars’ local characteristics.
Refer to caption
Figure 9: Text-based texture editing. With additional text prompts, our method supports texture editing in specific partial regions.
Refer to caption
Figure 10: Example of attribute inheritance from FLAME [27]. If the input template mesh is from 3DMMs [4], users could manipulate the deformed mesh freely via morphable attributes such as expression and shape parameters.
Refer to caption
Figure 11: More representative generation results.
Refer to caption
Figure 12: Examples of texture transfer. Since the UV coordinates of the input template mesh are well preserved, various generated texture maps can be directly applied to the generated mesh.
Refer to caption
Figure 13: Preserved semantic consistency (e.g., 3D landmarks and dense segmentation). Since our method retains mesh topology, facial features from the template mesh can be well preserved under regularization (Eq. 7) for further animation and editing.
Refer to caption
Figure 14: Generated head avatars that can be directly manipulated (e.g., extra accessories) in graphics software without mesh post-processing.
Refer to caption
Figure 15: Avatar generation with various mesh inputs. The meshes from SMPL-X [35] and an arbitrary head model are deformed into text-guided full-body and head avatars respectively.
Refer to caption
Figure 16: Generalization experiments. We implement the vector field in the framework of TextDeformer [10] and find that our method can accelerate mesh optimizations to approach text guidance in early epochs.