License: arXiv.org perpetual non-exclusive license
arXiv:2401.03476v1 [cs.MM] 07 Jan 2024

FREETALKER: CONTROLLABLE SPEECH AND TEXT-DRIVEN GESTURE GENERATION BASED ON DIFFUSION MODELS FOR ENHANCED SPEAKER NATURALNESS

Abstract

Current talking avatars mostly generate co-speech gestures based on audio and text of the utterance, without considering the non-speaking motion of the speaker. Furthermore, previous works on co-speech gesture generation have designed network structures based on individual gesture datasets, which results in limited data volume, compromised generalizability, and restricted speaker movements. To tackle these issues, we introduce FreeTalker, which, to the best of our knowledge, is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions. Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions, utilizing heterogeneous data sourced from various motion datasets. During inference, we utilize classifier-free guidance to highly control the style in the clips. Additionally, to create smooth transitions between clips, we utilize DoubleTake, a method that leverages a generative prior and ensures seamless motion blending. Extensive experiments show that our method generates natural and controllable speaker movements. Our code, model, and demo are are available at https://youngseng.github.io/FreeTalker/.

Index Terms—  Motion processing, gesture generation, multimodal learning, human-computer interaction

1 Introduction

In various applications like virtual agents, animation, and human-computer interaction, the motions of a speaker are of paramount importance [1, 2, 3, 4]. These motions can be primarily divided into two segments: co-speech gestures that are inherently tied to the spoken content and non-spontaneous motions exhibited during talks [1, 5].

In recent years, substantial focus has been dedicated to the generation of co-speech gestures. ZeroEGGS [6] emphasizes naturalness and zero-shot style control. [7] adapts DiffWave for audio-driven motion synthesis, highlighting distinctive styles and control. DiffuseStyleGesture [8] and GestureDiffuCLIP [9] generate stylized gestures with exceptional human likeness and appropriateness. However, existing works primarily focus on global style control of co-speech gestures and do not facilitate free movement of the speaker, such as walking around the stage, pointing or looking in specific directions, or interacting with the audience. These aspects are crucial in presentations and speeches. In the domain of non-spontaneous motions [10], some works such as MDM [11], M2DM [10], and MotionDiffuse [12], have focused on text-controlled motion generation, achieving improvements in realism and controllability. PriorMDM [13] introduces composition methods for denoising diffusion models.

Despite these notable advancements, a significant gap remains. To our knowledge, there hasn’t been an effort that coherently integrates both of these motion categories. Challenges arise from varied motion representations, and multimodal learning intricacies. MoFusion [14] addresses dataset harmonization through pretraining for multi-task learning. Similarly, [15] offers a framework for motion retargeting. UDE [16] introduces an engine for human motion sequences from diverse inputs. UnifiedGesture [17] employs further improvements in speech-driven gestures across multiple datasets. It’s important to recognize the inherent challenges in utilizing multiple datasets.

In this paper, we propose a novel framework for generating both spontaneous and non-spontaneous speaker motions. Specifically, we first develop a diffusion-based model [18] for speaker motion generation, utilizing heterogeneous data from various motion datasets. Then, we employ classifier-free guidance [19] during inference for highly controllable style in the generated clips. Additionally, we adopt DoubleTake [13] to create smooth transitions between clips and ensure seamless motion blending. The main contributions of our work are: (1) Proposing FreeTalker, the first framework to the best of our knowledge for generating both spontaneous and non-spontaneous speaker motions trained on multiple datasets. (2) Incorporating classifier-free guidance and DoubleTake in our diffusion-based model for enhanced flexibility and control in gesture generation. (3) Demonstrating improved naturalness in generated speaker motions through extensive experiments, surpassing existing approaches in terms of motion quality.

2 Proposed Approach

We aim to generate free-motion speakers using heterogeneous data from diverse motion datasets. In this section, we first describe the preprocessing steps required to integrate various motion data. Building on this, we introduce the diffusion model for motion generation. We then illustrate the controlled text-guided gesture generation method and explore long motion generation. Together, these components form a comprehensive system for effective generation of natural gestures.

2.1 Motion Processing

We expect that the features of the different motion datasets are correctly preserved. In contrast to [16] and [17], where [16] represents human motions with discrete codes and [17] retargets human motion to a homograph consisting of five terminal joints (head, hands, and feet), potentially losing important detailed information such as shoulders and fingers, our approach addresses this issue and preserves motion details. We first convert the rotation matrix of the motion capture (BVH format) data to an axis angle representation of SMPL-X [20]. For the 3D position dataset, we fit it to the SMPL-X representation using VPoser [20]. We then scale the 3D translations of the root joint appropriately and adjust the initial orientation to be uniform across the different datasets as [17]. With the SMPL-X model forward computation, we can obtain the 3D position of the SMPL-X representation. Then as in [21], we use root height, root linear and rotational velocity, joint rotation, joint position, joint velocity, and foot contact as kinematic feature representations. Each frame of the processed motion sequence has 659 dimensional features, which we denote as x^0TM×659subscript^𝑥0superscriptsubscript𝑇𝑀659\hat{x}_{0}\in\mathbb{R}^{T_{M}\times 659}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT × 659 end_POSTSUPERSCRIPT, where TMsubscript𝑇𝑀T_{M}italic_T start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT denotes the number of motion sequence frames.

2.2 Diffusion Model for Motion Generation

Refer to caption
Fig. 1: (Top) Denoising module. A noising step t𝑡titalic_t and a noisy motion sequence xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at this noising step conditioning on c𝑐citalic_c (including text description and audio) are fed into the model. PE indicates the addition of a positional encoding. (Bottom) Sample module. We predict the x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with the denoising process, then add the noise to the noising step xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT with the diffuse process. This process is repeated from t𝑡titalic_t = T𝑇Titalic_T until t=0𝑡0t=0italic_t = 0.

As illustrated in Figure 1, we develop a diffusion model [18] inspired by [11] and [8]. For a noising step tT𝑡𝑇t\in Titalic_t ∈ italic_T, we assume that xT𝒩(0,I)similar-tosubscript𝑥𝑇𝒩0𝐼x_{T}\sim\mathcal{N}(0,I)italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ). The model assumes a stochastic process with T𝑇Titalic_T noising steps: q(xtxt1)=𝒩(αtxt1,(1αt)I)𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝛼𝑡subscript𝑥𝑡11subscript𝛼𝑡𝐼q\left(x_{t}\mid x_{t-1}\right)=\mathcal{N}(\sqrt{\alpha_{t}}x_{t-1},(1-\alpha% _{t})I)italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_I ). The denoising process aims to predict the clean motion x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from a noised motion xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a noise step t𝑡titalic_t, a text condition encoded to CLIP [22] space (represented by d𝑑ditalic_d, d512𝑑superscript512d\in\mathbb{R}^{512}italic_d ∈ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT), and an audio condition. The audio representation, consistent with [23], includes MFCC, Mel Spectrum, Pitch, Energy, WavLM [24], and Onsets. We perform linear interpolation of audio features in the time dimension to match the number of gesture frames, denoted as a𝑎aitalic_a, where aTM×1133𝑎superscriptsubscript𝑇𝑀1133a\in\mathbb{R}^{T_{M}\times 1133}italic_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT × 1133 end_POSTSUPERSCRIPT. Subsequently, the textual description is spliced together as the first frame along with the speech embedding, the noise step, and the noisy action, feeding it into the self-attention [25] layer, to yield the generated motion sequence. The denoising process is expressed as x^0=Denoising(xt,t,c)subscript^𝑥0𝐷𝑒𝑛𝑜𝑖𝑠𝑖𝑛𝑔subscript𝑥𝑡𝑡𝑐\hat{x}_{0}=Denoising\left(x_{t},t,c\right)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_D italic_e italic_n italic_o italic_i italic_s italic_i italic_n italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ), where c=[d,a]𝑐𝑑𝑎c=[d,a]italic_c = [ italic_d , italic_a ]. In practice, due to the lack of datasets with both non-spontaneous speaker motion and co-speech gestures, we blend datasets with speech-driven gestures and text-driven motions, and the missing modalities are set to zero during training. The model is trained using Huber loss [26] function:

=Ex0q(x0c),t[1,T][x0x^022]subscript𝐸formulae-sequencesimilar-tosubscript𝑥0𝑞conditionalsubscript𝑥0𝑐similar-to𝑡1𝑇delimited-[]superscriptsubscriptnormsubscript𝑥0subscript^𝑥022\mathcal{L}=E_{x_{0}\sim q\left(x_{0}\mid c\right),t\sim[1,T]}\left[\|x_{0}-% \hat{x}_{0}\|_{2}^{2}\right]caligraphic_L = italic_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ italic_c ) , italic_t ∼ [ 1 , italic_T ] end_POSTSUBSCRIPT [ ∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (1)

During inference, at each noising step t𝑡titalic_t, the original sample x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is predicted and noised back to xT1subscript𝑥𝑇1x_{T-1}italic_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT. This process is iteratively repeated, starting from t=T𝑡𝑇t=Titalic_t = italic_T and continuing until t=0𝑡0t=0italic_t = 0 is reached, resulting in more natural motion generation.

2.3 Controllable text-guided gesture generation

Generating gestures that are both expressive and consistent with textual descriptions is a challenge. Our diffusion model addresses this problem by extending the core idea of the classifier-free approach [19, 7, 8] to adjust the strength of the non-spontaneous motion. As illustrated in Figure 1, a random mask is added to the textual embedding for classifier-free learning. The classifier-free guidance of gesture generation is achieved by combining the predictions of the text-conditioned model Denoise(xt,t,c1)Denoisesubscript𝑥𝑡𝑡subscript𝑐1\operatorname{Denoise}\left(x_{t},t,c_{1}\right)roman_Denoise ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), where c1=[d,a]subscript𝑐1𝑑𝑎c_{1}=[d,a]italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ italic_d , italic_a ], and the audio-conditioned model Denoise(xt,t,c2)Denoisesubscript𝑥𝑡𝑡subscript𝑐2\operatorname{Denoise}\left(x_{t},t,c_{2}\right)roman_Denoise ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where c2=[,a]subscript𝑐2𝑎c_{2}=[\varnothing,a]italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ ∅ , italic_a ], as follows:

x^0,γ,c1,c2subscript^𝑥0𝛾subscript𝑐1subscript𝑐2\displaystyle\hat{x}_{0,\gamma,c_{1},c_{2}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_γ , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =γDenoise(xt,t,c1)+(1γ)Denoise(xt,t,c2)absent𝛾Denoisesubscript𝑥𝑡𝑡subscript𝑐11𝛾Denoisesubscript𝑥𝑡𝑡subscript𝑐2\displaystyle=\gamma\operatorname{Denoise}\left(x_{t},t,c_{1}\right)+(1-\gamma% )\operatorname{Denoise}\left(x_{t},t,c_{2}\right)= italic_γ roman_Denoise ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( 1 - italic_γ ) roman_Denoise ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (2)

where x^0,γ,c1,c2subscript^𝑥0𝛾subscript𝑐1subscript𝑐2\hat{x}_{0,\gamma,c_{1},c_{2}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 , italic_γ , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the combined output, and γ𝛾\gammaitalic_γ is a parameter controlling the balance between the text-conditioned and audio-conditioned models. In this work, the Denoising module learns both text-conditioned and audio-conditioned distributions by randomly masking 10% of the samples using Bernoulli masks.

2.4 Long Motion Generation

In tasks involving time series, a major challenge is generating long and coherent motion sequences. Traditional approaches leveraging seed poses [8, 27] in generative tasks with non-time-aware sequences (e.g., text-to-motion) do not work well, so we use DoubleTake [13] to generate long-distance motion. Specifically, we first generate samples conditioned on its own text, audio and a handshake τ𝜏\tauitalic_τ with its neighboring intervals through the denoising process, formulated as τi=(1α)Mi1[h:]+αMi[:h]\tau_{i}=(1-\vec{\alpha})\odot M_{i-1}[-h:]+\vec{\alpha}\odot M_{i}[:h]italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 - over→ start_ARG italic_α end_ARG ) ⊙ italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT [ - italic_h : ] + over→ start_ARG italic_α end_ARG ⊙ italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ : italic_h ], where hhitalic_h is the length of τ𝜏\tauitalic_τ, Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sequence αj=j/h,j:j[0:h)\alpha_{j}=j/h,\forall j:j\in[0:h)italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_j / italic_h , ∀ italic_j : italic_j ∈ [ 0 : italic_h ) and direct-product\odot indicates a element-wise multiplication. Then we refine the transitions by resha** the batch and focusing on the ‘transition sandwich’ (Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Mi+1subscript𝑀𝑖1M_{i+1}italic_M start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT). We apply a soft-masking feature, using soft mask Msoftsubscript𝑀𝑠𝑜𝑓𝑡M_{soft}italic_M start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT and hard mask Mhardsubscript𝑀𝑎𝑟𝑑M_{hard}italic_M start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT for the sequence M𝑀Mitalic_M and handshake τ𝜏\tauitalic_τ. The masks ensure a gradual transition between the mask values, allowing b𝑏bitalic_b-frame-long linear masking between Mhardsubscript𝑀𝑎𝑟𝑑M_{hard}italic_M start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT and Msoftsubscript𝑀𝑠𝑜𝑓𝑡M_{soft}italic_M start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT. This process refines the originally generated motion (suffix or prefix) to fit the transition during the second take at every denoising step. We partially noise and denoise the sandwich Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT noising steps: M′′=M+MhardMsoft(MnoisyM)superscript𝑀′′superscript𝑀direct-productsubscript𝑀𝑎𝑟𝑑subscript𝑀𝑠𝑜𝑓𝑡subscriptsuperscript𝑀𝑛𝑜𝑖𝑠𝑦superscript𝑀M^{\prime\prime}=M^{\prime}+M_{hard}\odot M_{soft}\odot(M^{\prime}_{noisy}-M^{% \prime})italic_M start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_M start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT ⊙ italic_M start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT ⊙ ( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n italic_o italic_i italic_s italic_y end_POSTSUBSCRIPT - italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Here, M′′superscript𝑀′′M^{\prime\prime}italic_M start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT is the refined transition of the sequence, Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the original. Finally, we construct the long motion by unfolding the refined sequences and transitions, resulting in a smooth motion.

3 Experiments

3.1 Experiment setup

In our experiments, we use the text-driven (non-spontaneous) motion generation dataset HumanML3D [21] and speech-driven (spontaneous) gesture generation dataset BEAT [28]. All motion data are first resampled to 20 FPS. For HumanML3D dataset, we only use data with motion frame counts between 40 and 180 frames, and the maximum text length for CLIP encoding is set to 20. During training, the motion sequence length is set to TM=180subscript𝑇𝑀180T_{M}=180italic_T start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = 180 frames, with zero-padding for shorter sequences. And for BEAT, we use four English speaker gestures as described in [28] and randomly select a 180-frame segment of speech and corresponding gestures from a continuous gesture sequence. To balance the number of motion data samples from both datasets, we employ weighted sampling to construct the dataloader. All motion data are normalized by subtracting the mean and dividing by the standard deviation. The data is split into training, validation, and testing sets in an 8:1:1 ratio. For the diffusion model, we use T𝑇Titalic_T = 1000 noising steps and a cosine noise schedule. The self-attention layer has a hidden space dimension of 256. The batch size is 256, the learning rate is 2e-4, and the total number of training steps is set to 1 million. The model is trained on a V100 GPU for five days. The DoubleTake method with a handshake size h=2020h=20italic_h = 20, a blend length b=10𝑏10b=10italic_b = 10, a maximum Mhardsubscript𝑀𝑎𝑟𝑑M_{hard}italic_M start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d end_POSTSUBSCRIPT value of 85%percent8585\%85 %, a minimum Msoftsubscript𝑀𝑠𝑜𝑓𝑡M_{soft}italic_M start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t end_POSTSUBSCRIPT value of 15%percent1515\%15 %, and T=900superscript𝑇900T^{\prime}=900italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 900 denoising steps for Mnoisysubscriptsuperscript𝑀𝑛𝑜𝑖𝑠𝑦M^{\prime}_{noisy}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n italic_o italic_i italic_s italic_y end_POSTSUBSCRIPT.

3.2 Experimental results and analysis

3.2.1 Visualization

Table 1: Quantitative results of comparison with the baseline models and ablation studies. ‘\rightarrow’ denotes the closer to the real motion the better. ‘Naturalness’ denotes the “Ours vs. Compared model” of the user study. ‘{}^{\star}start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT’ denotes “Ours vs. Ground Truth”, implying a more rigorous evaluation, while entries without an asterisk are in reference to comparisons with other models. ‘*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT’ denotes the proposed model. ‘-’ and ‘+’ denote the removal and addition of the component, respectively.
Name Co-speech gesture generation Motion Generation Free-motion
jerk \rightarrow acceleration \rightarrow FID \downarrow Naturalness \uparrow SSIM \uparrow FID \downarrow Naturalness \uparrow FID \downarrow Naturalness \uparrow
Natural Mocap 135.36 ±plus-or-minus\pm± 58.61 12.39 ±plus-or-minus\pm± 11.79 - - - - - - -
DiffuseStyleGesture [8] 206.52 ±plus-or-minus\pm± 83.65 5.68 ±plus-or-minus\pm± 2.19 0.008 49% - - - - -
MDM [11] - - - - 0.386 0.050 53% - -
Ours*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 245.78 ±plus-or-minus\pm± 108.27 6.03 ±plus-or-minus\pm± 2.55 0.139 40%{}^{\star}start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT 0.457 0.226 24%{}^{\star}start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT 0.139 -
    - Huber loss*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 226.30 ±plus-or-minus\pm± 73.53 5.98 ±plus-or-minus\pm± 2.33 0.027 52% 0.389 0.041 53% 0.029 54%
    + local attention*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 203.77 ±plus-or-minus\pm± 84.45 5.97 ±plus-or-minus\pm± 2.51 0.005 49% 0.431 0.051 54% 0.032 52%
Refer to caption
Fig. 2: Visualization of FreeTalker generation. We can control the speaker’s non-spontaneous motion through text, while the speaker generates spontaneous co-speech gestures from speech. The light yellow color indicates the model’s ability to smoothly transition between motion segments.

As illustrated in Figure 2, FreeTalker generates a sequence of motions, including the speaker walking on stage, waving to the audience, delivering a speech, and finally bowing before leaving the stage. The generated motions exhibit smooth transitions between segments, allowing the speaker to move and speak in a natural manner.

Refer to caption
Fig. 3: Visualization of style editing (non-spontaneous motion control) based on co-speech gestures. From top to bottom, generated motions gradually transition from text description-based control to spontaneous co-speech gestures based on speech, resulting in highly controllable gestures.

As shown in Figure 3, when γ𝛾\gammaitalic_γ in Equation (2) is set to 0, the gesture generation is conditioned only on speech input, enabling the model to produce co-speech gestures. As γ𝛾\gammaitalic_γ gradually increases from 0 to 1, the model generates non-spontaneous gestures while maintaining alignment with speech. This allows us to freely edit the generated gestures and motions according to the text description.

3.2.2 Objective Evaluation

Due to the lack of methods capable of generating both spontaneous co-speech gestures and non-spontaneous motions, we evaluate each type of motion separately. We select [8] and [11] as our baseline models, as they have recently achieved excellent results. For co-speech gesture generation, we assess jerk, acceleration [29], and FID [30]; on the other hand, for textual description-driven motion generation, we evaluate SSIM [31] and FID. The results are shown in Table 1. Our method attains competitive results with the baseline models for both generation tasks, demonstrating the effectiveness of our approach. Moreover, our method slightly outperforms the baselines in terms of jerk, acceleration and SSIM metrics.

3.2.3 Subjective Evaluation

To further evaluate the quality of the generated motions, we conducted a user study focusing on the naturalness (quality of the generated motions). The study consisted of ten pairs of naturalness scoring, evaluating the naturalness of motions generated solely by co-speech gestures, solely by text-driven motions, and a combination of both. During the evaluation, participants were presented with motion sequences generated by our model and the compared models. Following [11], users were prompted with the question: ”Which motion appears more human-like and reasonable?” 25 people participated in the study. The results are shown in Table 1. A score closer to 100% denotes higher naturalness. It can be observed that our model demonstrates commendable performance, often rivaling the baseline models in terms of perceived naturalness. This suggests that expanding the motion database could further improve the performance.

Our method significantly enhances the Speech2Gesture and Text2Motion subtasks, as shown in Table 1. It improves motion accuracy and naturalness, offering a diverse range of gestures, both spontaneous and non-spontaneous. This approach fills gaps in current methodologies and introduces a more adaptable motion generation framework.

3.2.4 Ablation study

To investigate the effectiveness of different components of our method, we designed the following ablation experiments. The results are detailed in the bottom two rows of Table 1. When the model is trained without Huber loss and instead uses MSE loss, the overall performance experiences a slight decline. Huber loss is more robust to outliers, generalizes better, and is better suited for smoothing the gradient to obtain a more coherent and natural sequence of actions. Furthermore, it converges to better results with fewer iterations. When we feed a𝑎aitalic_a into the local attention network [32] with relative position encoding [33] to extract the local information related to the gesture before the self-attention layer as [8], the performance of co-speech gesture generation decreases slightly. However, the performance of motion generation improves. This illustrates the necessity of balancing different motion generation tasks to maintain optimal performance.

4 Conclusions

In this paper, we presented FreeTalker, a simple yet effective framework for generating both spontaneous and non-spontaneous speaker motions. Leveraging a diffusion-based model, our method is trained on heterogeneous data sourced from various motion datasets. The incorporation of classifier-free guidance and DoubleTake during inference stage ensures the natural, highly controllable and long-range motion generation. Moreover, our approach lays the foundation for future work on large-scale motion datasets and more sophisticated models, paving the way for further advancements in speaker motion generation and enhancing talking avatars’ naturalness in various applications.

We intend to elaborate on extending our work to the generation of fully digital humans, encompassing motions, facial expressions, and lip movements. We also aim to explore a more unified approach to digital human generation.

Acknowledgments

This work is supported by National Natural Science Foundation of China (62076144), Shenzhen Key Laboratory of next generation interactive media innovative technology (ZDSYS20210623092001004), Shenzhen Science and Technology Program (WDZC20220816140515001, JCYJ20220-
818101014030) and Tencent AI Lab Rhino-Bird Focused Research Program (RBFR2023015).

References

  • [1] Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, et al., “A comprehensive review of data-driven co-speech gesture generation,” in Computer Graphics Forum, 2023, vol. 42, pp. 569–596.
  • [2] Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, et al., “The genea challenge 2023: A large scale evaluation of gesture generation models in monadic and dyadic settings,” arXiv preprint arXiv:2308.12646, 2023.
  • [3] Haolin Zhuang, Shun Lei, Long Xiao, et al., “Gtn-bailando: Genre consistent long-term 3d dance generation based on pre-trained genre token network,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  • [4] Zunnan Xu, Yachao Zhang, Sicheng Yang, Ronghui Li, and Xiu Li, “Chain of generation: Multi-modal gesture synthesis via cascaded conditional control,” arXiv preprint arXiv:2312.15900, 2023.
  • [5] Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, et al., “Human motion generation: A survey,” arXiv preprint arXiv:2307.10894, 2023.
  • [6] Saeed Ghorbani, Ylva Ferstl, Daniel Holden, et al., “Zeroeggs: Zero-shot example-based gesture generation from speech,” in Computer Graphics Forum, 2023.
  • [7] Simon Alexanderson, Rajmund Nagy, Jonas Beskow, et al., “Listen, denoise, action! audio-driven motion synthesis with diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–20, 2023.
  • [8] Sicheng Yang, Zhiyong Wu, Minglei Li, et al., “Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models,” International Joint Conference on Artificial Intelligence, 2023.
  • [9] Tenglong Ao, Zeyi Zhang, and Libin Liu, “Gesturediffuclip: Gesture diffusion model with clip latents,” ACM Trans. Graph., 2023.
  • [10] Hanyang Kong, Kehong Gong, Dongze Lian, et al., “Priority-centric human motion generation in discrete latent space,” arXiv preprint arXiv:2308.14480, 2023.
  • [11] Guy Tevet, Sigal Raab, Brian Gordon, et al., “Human motion diffusion model,” in The Eleventh International Conference on Learning Representations, 2023.
  • [12] Mingyuan Zhang, Zhongang Cai, et al., “Motiondiffuse: Text-driven human motion generation with diffusion model,” arXiv preprint arXiv:2208.15001, 2022.
  • [13] Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H Bermano, “Human motion diffusion as a generative prior,” arXiv preprint arXiv:2303.01418, 2023.
  • [14] Jianxin Ma, Shuai Bai, and Chang Zhou, “Pretrained diffusion models for unified human motion synthesis,” arXiv preprint arXiv:2212.02837, 2022.
  • [15] Kfir Aberman, Peizhuo Li, Dani Lischinski, et al., “Skeleton-aware networks for deep motion retargeting,” ACM Transactions on Graphics (TOG), vol. 39, no. 4, pp. 62–1, 2020.
  • [16] Zixiang Zhou and Baoyuan Wang, “Ude: A unified driving engine for human motion generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5632–5641.
  • [17] Sicheng Yang, Zilin Wang, et al., “Unifiedgesture: A unified gesture synthesis model for multiple skeletons,” ACM International Conference on Multimedia, 2023.
  • [18] Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, pp. 6840–6851, 2020.
  • [19] Jonathan Ho and Tim Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
  • [20] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, et al., “Expressive body capture: 3d hands, face, and body from a single image,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10975–10985.
  • [21] Chuan Guo, Shihao Zou, Xinxin Zuo, et al., “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5152–5161.
  • [22] Alec Radford, Jong Wook Kim, Chris Hallacy, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
  • [23] Sicheng Yang, Haiwei Xue, Zhensong Zhang, et al., “The diffusestylegesture+ entry to the genea challenge 2023,” arXiv preprint arXiv:2308.13879, 2023.
  • [24] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  • [25] Ashish Vaswani, Noam Shazeer, Niki Parmar, et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [26] Peter J Huber, “Robust estimation of a location parameter,” in Breakthroughs in statistics: Methodology and distribution, pp. 492–518. Springer, 1992.
  • [27] Jonathan Tseng, Rodrigo Castellon, and Karen Liu, “Edge: Editable dance generation from music,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 448–458.
  • [28] Haiyang Liu, Zihao Zhu, Naoya Iwamoto, et al., “Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” in European Conference on Computer Vision, 2022.
  • [29] Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström, “Analyzing input and output representations for speech-driven gesture generation,” in Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, 2019, pp. 97–104.
  • [30] Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, et al., “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” ACM Transactions on Graphics (TOG), 2020.
  • [31] Alain Hore and Djemel Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010 20th international conference on pattern recognition. IEEE, 2010.
  • [32] Aurko Roy, Mohammad Saffar, Ashish Vaswani, et al., “Efficient content-based sparse attention with routing transformers,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 53–68, 2021.
  • [33] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya, “Reformer: The efficient transformer,” in International Conference on Learning Representations, 2020.