HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata
  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.15185v1 [cs.CL] 23 Dec 2023

emotion2vec: Self-Supervised Pre-Training
for Speech Emotion Representation

Ziyang Ma1, Zhisheng Zheng1, Jiaxin Ye2, **chao Li3,
Zhifu Gao4, Shiliang Zhang4, Xie Chen1222Corresponding author
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Shanghai Jiao Tong University, 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Fudan University,
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT The Chinese University of Hong Kong, 44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Alibaba
Abstract

We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on open-source unlabeled emotion data through self-supervised online distillation, combining utterance-level loss and frame-level loss during pre-training. emotion2vec outperforms state-of-the-art pre-trained universal models and emotion specialist models by only training linear layers for the speech emotion recognition task on the mainstream IEMOCAP dataset. In addition, emotion2vec shows consistent improvements among 10 different languages of speech emotion recognition datasets. emotion2vec also shows excellent results on other emotion tasks, such as song emotion recognition, emotion prediction in conversation, and sentiment analysis. Comparison experiments, ablation experiments, and visualization comprehensively demonstrate the universal capability of the proposed emotion2vec. To the best of our knowledge, emotion2vec is the first universal representation model in various emotion-related tasks, filling a gap in the field. 111Code, checkpoints, and extracted features are available at https://github.com/ddlBoJack/emotion2vec

emotion2vec: Self-Supervised Pre-Training
for Speech Emotion Representation


Ziyang Ma1, Zhisheng Zheng1, Jiaxin Ye2, **chao Li3, Zhifu Gao4, Shiliang Zhang4, Xie Chen1222Corresponding author 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Shanghai Jiao Tong University, 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Fudan University, 33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT The Chinese University of Hong Kong, 44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT Alibaba

1 Introduction

Extracting emotional representation from speech is an essential step of various emotional tasks such as speech emotion recognition (SER) and sentiment analysis. Traditional methods employ Filter Banks (FBanks) or Mel Frequency Cepstrum Coefficients (MFCCs) as speech features. These features are not rich in semantic information, resulting in limited performance on emotional tasks. Popular methods utilize features extracted from speech-based self-supervised learning (SSL) pre-trained models, leading to a significant performance improvement.

One potential challenge blocking further performance improvement is that these SSL models are not entirely suitable for emotional tasks. Wang et al. (2021) explore no fine-tuning, partial fine-tuning, and entire fine-tuning with some SSL models for SER on the IEMOCAP dataset Busso et al. (2008), and give some empirical conclusions. While this is an ad-hoc solution, on the one hand, fine-tuning SSL models requires a large computational cost, on the other hand, these conclusions may be data-specific or model-constrained. Recently, Chen et al. (2023a) proposed an SER model named Vesper, which is obtained by model distillation from WavLM-large Chen et al. (2022) with emotion data. Vesper is designed to perform the SER task, whose universal representation capability still needs to be demonstrated. Accordingly, a universal speech-based emotion representation model is urgently needed in the field.

Here we propose emotion2vec, a universal emotion representation model that can be used to extract speech features for diverse emotion tasks. Self-supervised pre-training is performed on 262 hours of open-source emotion data with an online distillation paradigm to obtain emotion2vec. Considering that both whole-play information and local details convey emotion, we propose a pre-training strategy combining utterance-level loss and frame-level loss. On the mainstream IEMOCAP dataset, the downstream linear model trained with features extracted from emotion2vec outperforms all the mainstream SSL models and the latest specialist models. emotion2vec is tested on 13 datasets including 10 languages, and the results show that emotion2vec exhibits language generalization ability. Moreover, in addition to the SER task, we also experimented with emotion2vec features on song emotion recognition, emotion prediction in conversation, and sentiment analysis. The results indicate that emotion2vec has excellent task generalization ability. Extensive ablation experiments and visualization analysis demonstrate the effectiveness of our pre-training methods and the versatility of the proposed emotion2vec model.

2 Related Work

2.1 Speech-based SSL

Self-supervised learning has achieved remarkable success in the field of representation learning, showcasing its efficacy across natural language processing Devlin et al. (2019); Liu et al. (2019); Radford et al. (2019); Brown et al. (2020), computer vision Grill et al. (2020); He et al. (2020); Bao et al. (2021); He et al. (2022), as well as speech processing Baevski et al. (2020); Hsu et al. (2021); Chen et al. (2022); Baevski et al. (2022). For speech representation learning, all SSL models can be classified into two categories according to the self-supervised targets utilized during pre-training Ma et al. (2023b): 1) Offline targets. 2) Online targets. Models employing offline targets often require a well-trained teacher model before the pre-training stage, to extract self-supervised targets. Representative models of this type are HuBERT Hsu et al. (2021), WavLM Chen et al. (2022) using K-means targets, and PBERT Wang et al. (2022), MonoBERT&PolyBERT Ma et al. (2023c) using phoneme-based targets. Models using online targets do not need a pre-trained teacher model in advance, while the teacher models are constantly updated during the pre-training phase, with an online distillation paradigm. Representative models of this type are data2vec Baevski et al. (2022), data2vec 2.0 Baevski et al. (2023) using frame-level mask language model (MLM) loss, and CA-DINO Han et al. (2023) using utterance-level cross-entropy loss. emotion2vec is pre-trained combining both utterance-level loss and frame-level loss, leading to a superior speech emotion representation model.

2.2 Speech Emotion Representation

We present the first universal speech emotion representation model, whereas most of the previous works directly employ speech pre-training models Pepino et al. (2021); Li et al. (2022), or fine-tune the pre-training models on their specific emotional data with specific emotional tasks (mostly SER) Morais et al. (2022); Chen and Rudnicky (2023), to extract speech emotion representation. A series of works investigate the SER performance of wav2vec 2.0 Wang et al. (2021), HuBERT Wang et al. (2021), as well as WavLM Ioannides et al. (2023), either fine-tuning or not. A recent work Ma et al. (2023a) found that data2vec features also have a good representation ability in the SER task. For speech emotion representation in other emotion tasks, such as multimodal emotion recognition, popular practice Li et al. (2023a) is similar to what is mentioned above.

3 Methods

Here we mainly introduce the self-supervised pre-training method of the proposed emotion2vec, for which the core is to train the model with Utterance-level Loss and Frame-level Loss using Online Distillation paradigm.

Refer to caption
Figure 1: The overall framework of emotion2vec. During the pre-training phase, emotion2vec conducts online distillation with a teacher network and a student network. When a specific downstream task is performed, emotion2vec is frozen and a lightweight downstream model is trained.

3.1 Model Pipeline

As shown in Figure 1, emotion2vec contains two networks in the pre-training phase, which are the teacher network 𝒯𝒯\mathcal{T}caligraphic_T and the student network 𝒮𝒮\mathcal{S}caligraphic_S. Both models share the same model architecture, including a feature extractor \mathcal{F}caligraphic_F composed of multi-layer convolutional neural networks and a backbone network \mathcal{B}caligraphic_B composed of multi-layer Transformers. These modules can be configured with different architectures, which will be described in Section 4.1. Given a raw audio utterance X=[x1,,xNx]𝑋subscript𝑥1subscript𝑥subscript𝑁𝑥X=[x_{1},\cdots,x_{N_{x}}]italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ], the Teacher 𝒯𝒯\mathcal{T}caligraphic_T and the Student 𝒮𝒮\mathcal{S}caligraphic_S respectively utilize feature extractors 𝒯superscript𝒯\mathcal{F}^{\mathcal{T}}caligraphic_F start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT and 𝒮superscript𝒮\mathcal{F}^{\mathcal{S}}caligraphic_F start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT to obtain the downsampled features Z=[z1,,zNz]𝑍subscript𝑧1subscript𝑧subscript𝑁𝑧Z=[z_{1},\cdots,z_{N_{z}}]italic_Z = [ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ], which can be written as:

Z𝒯=𝒯(X),superscript𝑍𝒯superscript𝒯𝑋Z^{\mathcal{T}}=\mathcal{F}^{\mathcal{T}}(X),italic_Z start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_X ) , (1)
Z𝒮=𝒮(X).superscript𝑍𝒮superscript𝒮𝑋Z^{\mathcal{S}}=\mathcal{F}^{\mathcal{S}}(X).italic_Z start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ( italic_X ) . (2)

For the teacher network 𝒯𝒯\mathcal{T}caligraphic_T, the downsampled features Z𝒯superscript𝑍𝒯Z^{\mathcal{T}}italic_Z start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT are directly fed into the backbone network 𝒯superscript𝒯\mathcal{B}^{\mathcal{T}}caligraphic_B start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT. For the student network 𝒮𝒮\mathcal{S}caligraphic_S, the downsampled features Z𝒮superscript𝑍𝒮Z^{\mathcal{S}}italic_Z start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT are masked l𝑙litalic_l consecutive frames with probability p𝑝pitalic_p for each frame as the start. Then learnable utterance embedding U=[u1,,uNu]𝑈subscript𝑢1subscript𝑢subscript𝑁𝑢U=[u_{1},\cdots,u_{N_{u}}]italic_U = [ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] is placed in the front before being fed into the backbone network 𝒮superscript𝒮\mathcal{B}^{\mathcal{S}}caligraphic_B start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT. The formula can be written as follows:

Y𝒯=1ki=1ki𝒯(Z𝒯),superscript𝑌𝒯1𝑘superscriptsubscript𝑖1𝑘superscriptsubscript𝑖𝒯superscript𝑍𝒯Y^{\mathcal{T}}=\frac{1}{k}\sum_{i=1}^{k}\mathcal{B}_{i}^{\mathcal{T}}(Z^{% \mathcal{T}}),italic_Y start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_Z start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ) , (3)
U𝒮;Y𝒮=𝒮(U;Mask(Z𝒮)),superscript𝑈𝒮superscript𝑌𝒮superscript𝒮𝑈𝑀𝑎𝑠𝑘superscript𝑍𝒮U^{\mathcal{S}};Y^{\mathcal{S}}=\mathcal{B}^{\mathcal{S}}(U;Mask(Z^{\mathcal{S% }})),italic_U start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ; italic_Y start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT = caligraphic_B start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ( italic_U ; italic_M italic_a italic_s italic_k ( italic_Z start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ) ) , (4)

where Y𝒯superscript𝑌𝒯Y^{\mathcal{T}}italic_Y start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT is the average of the output embedding of the top k𝑘kitalic_k layer Transformer Block i𝒯superscriptsubscript𝑖𝒯\mathcal{B}_{i}^{\mathcal{T}}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT. Utterance-level output embedding U𝒮superscript𝑈𝒮U^{\mathcal{S}}italic_U start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT and frame-level output embedding Y𝒮superscript𝑌𝒮Y^{\mathcal{S}}italic_Y start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT are the outputs of the student backbone network 𝒮superscript𝒮\mathcal{B}^{\mathcal{S}}caligraphic_B start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT. Mask𝑀𝑎𝑠𝑘Maskitalic_M italic_a italic_s italic_k is the applying mask operation. Y𝒯superscript𝑌𝒯Y^{\mathcal{T}}italic_Y start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, Y𝒮superscript𝑌𝒮Y^{\mathcal{S}}italic_Y start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT and U𝒮superscript𝑈𝒮U^{\mathcal{S}}italic_U start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT are the same in the hidden layer dimensions, where Y𝒯superscript𝑌𝒯Y^{\mathcal{T}}italic_Y start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT and Y𝒮superscript𝑌𝒮Y^{\mathcal{S}}italic_Y start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT have the same Nzsubscript𝑁𝑧N_{z}italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT temporal dimensions, while U𝒮superscript𝑈𝒮U^{\mathcal{S}}italic_U start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT has Nusubscript𝑁𝑢N_{u}italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT temporal dimensions, respectively.

3.2 Utterance-level Loss

Utterance-level loss constructs an utterance-level pretext task to learn the global emotion. We use mean squared error (MSE) to calculate the loss, which can be written as:

LUtt=(Y¯𝒯U¯𝒮)2,subscript𝐿𝑈𝑡𝑡superscriptsuperscript¯𝑌𝒯superscript¯𝑈𝒮2L_{Utt}=(\bar{Y}^{\mathcal{T}}-\bar{U}^{\mathcal{S}})^{2},italic_L start_POSTSUBSCRIPT italic_U italic_t italic_t end_POSTSUBSCRIPT = ( over¯ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT - over¯ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

where

Y¯𝒯=1Nzi=1NzYi𝒯,superscript¯𝑌𝒯1subscript𝑁𝑧superscriptsubscript𝑖1subscript𝑁𝑧superscriptsubscript𝑌𝑖𝒯\bar{Y}^{\mathcal{T}}=\frac{1}{N_{z}}\sum_{i=1}^{N_{z}}Y_{i}^{\mathcal{T}},over¯ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT , (6)
U¯𝒮=1Nui=1NuUi𝒮,superscript¯𝑈𝒮1subscript𝑁𝑢superscriptsubscript𝑖1subscript𝑁𝑢superscriptsubscript𝑈𝑖𝒮\bar{U}^{\mathcal{S}}=\frac{1}{N_{u}}\sum_{i=1}^{N_{u}}U_{i}^{\mathcal{S}},over¯ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT , (7)

which means that utterance-level loss LUttsubscript𝐿𝑈𝑡𝑡L_{Utt}italic_L start_POSTSUBSCRIPT italic_U italic_t italic_t end_POSTSUBSCRIPT is computed by temporal pooling results of Y𝒯superscript𝑌𝒯Y^{\mathcal{T}}italic_Y start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT and U𝒮superscript𝑈𝒮U^{\mathcal{S}}italic_U start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT. Here we propose three ways to compute utterance-level loss, which we call token embedding, chunk embedding, and global embedding, as shown in Figure 2.

Token Embedding

Token embedding employs a single token to represent global emotion information encoded by the student network 𝒮𝒮\mathcal{S}caligraphic_S. More explicitly, we set Nusubscript𝑁𝑢N_{u}italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT to 1 in the learnable utterance embedding U=[u1,,uNu]𝑈subscript𝑢1subscript𝑢subscript𝑁𝑢U=[u_{1},\cdots,u_{N_{u}}]italic_U = [ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ].

Chunk Embedding

Chunk embedding employs multiple tokens to represent global emotion information. In this case, more global information can be aggregated within the chunk.

Global Embedding

In the case of utilizing global embedding, no additional utterance tokens are added. We use temporal pooling of frame-level output embedding Y𝒮superscript𝑌𝒮Y^{\mathcal{S}}italic_Y start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT instead of U𝒮superscript𝑈𝒮U^{\mathcal{S}}italic_U start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT to compute the loss.

Refer to caption
Figure 2: Different ways to compute utterance-level loss in emotion2vec pre-training.

3.3 Frame-level Loss

Frame-level loss constructs a frame-wise pretext task to learn the context emotion. We only compute the loss on the masked part, which is the common practice for a mask language modeling(MLM) pretext task. The frame-level loss LFrmsubscript𝐿𝐹𝑟𝑚L_{Frm}italic_L start_POSTSUBSCRIPT italic_F italic_r italic_m end_POSTSUBSCRIPT can be expressed as:

LFrm=1Mi𝕄(Yi𝒯Yi𝒮)2,subscript𝐿𝐹𝑟𝑚1𝑀subscript𝑖𝕄superscriptsuperscriptsubscript𝑌𝑖𝒯superscriptsubscript𝑌𝑖𝒮2L_{Frm}=\frac{1}{M}\sum_{i\in\mathbb{M}}(Y_{i}^{\mathcal{T}}-Y_{i}^{\mathcal{S% }})^{2},italic_L start_POSTSUBSCRIPT italic_F italic_r italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ blackboard_M end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT - italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (8)

where 𝕄𝕄\mathbb{M}blackboard_M denotes the index sequence of frame-level output embedding Y𝒮superscript𝑌𝒮Y^{\mathcal{S}}italic_Y start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT being masked, and M𝑀Mitalic_M denotes the total number of tokens being masked.

3.4 Online Distillation

Online distillation is a self-supervised learning strategy for teacher-student learning, where the student network updates parameters by backpropagation and the teacher network updates parameters with an exponentially moving average (EMA) (Grill et al., 2020). For the student network 𝒮𝒮\mathcal{S}caligraphic_S, the total loss L𝐿Litalic_L for backpropagation is a combination of frame-level loss LFrmsubscript𝐿𝐹𝑟𝑚L_{Frm}italic_L start_POSTSUBSCRIPT italic_F italic_r italic_m end_POSTSUBSCRIPT and utterance-level loss LUttsubscript𝐿𝑈𝑡𝑡L_{Utt}italic_L start_POSTSUBSCRIPT italic_U italic_t italic_t end_POSTSUBSCRIPT, donated as:

L=LFrm+αLUtt,𝐿subscript𝐿𝐹𝑟𝑚𝛼subscript𝐿𝑈𝑡𝑡L=L_{Frm}+\alpha L_{Utt},italic_L = italic_L start_POSTSUBSCRIPT italic_F italic_r italic_m end_POSTSUBSCRIPT + italic_α italic_L start_POSTSUBSCRIPT italic_U italic_t italic_t end_POSTSUBSCRIPT , (9)

with a tunable weight α𝛼\alphaitalic_α. For the teacher network 𝒯𝒯\mathcal{T}caligraphic_T, The parameters θ0𝒯superscriptsubscript𝜃0𝒯\theta_{0}^{\mathcal{T}}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT are initialized as the same parameters of the student network θ0𝒮superscriptsubscript𝜃0𝒮\theta_{0}^{\mathcal{S}}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, and then are updated with EMA within each mini-batch, donated as:

θt+1𝒯=τθt𝒯+(1τ)θt+1𝒮.superscriptsubscript𝜃𝑡1𝒯𝜏superscriptsubscript𝜃𝑡𝒯1𝜏superscriptsubscript𝜃𝑡1𝒮\theta_{t+1}^{\mathcal{T}}=\tau\theta_{t}^{\mathcal{T}}+(1-\tau)\theta_{t+1}^{% \mathcal{S}}.italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT = italic_τ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT + ( 1 - italic_τ ) italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT . (10)

where τ𝜏\tauitalic_τ is a parameter that increases linearly during pre-training. In practice, within each mini-batch the parameters of teacher feature extractor 𝒯superscript𝒯\mathcal{F}^{\mathcal{T}}caligraphic_F start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT are copied directly from 𝒮superscript𝒮\mathcal{F}^{\mathcal{S}}caligraphic_F start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, while the parameters of teacher backbone network 𝒯superscript𝒯\mathcal{B}^{\mathcal{T}}caligraphic_B start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT are updated with EMA from 𝒯superscript𝒯\mathcal{B}^{\mathcal{T}}caligraphic_B start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT and 𝒮superscript𝒮\mathcal{B}^{\mathcal{S}}caligraphic_B start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT.

4 Experiments Setup

4.1 Initial Model

Different initial models lead to different architectures of feature extractors \mathcal{F}caligraphic_F, backbone networks \mathcal{B}caligraphic_B, and initialization parameters θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Here we adopt two models, data2vec 222https://dl.fbaipublicfiles.com/fairseq/data2vec/audio_base_ls.pt and data2vec 2.0 333https://dl.fbaipublicfiles.com/fairseq/data2vec2/base_libri.pt, both of which have the same feature extractor design but different backbone network designs. The feature extractor \mathcal{F}caligraphic_F is a 7-layer 1-D convolutional neural network with kernel sizes (5,2,2,2,2,2,2)5222222(5,2,2,2,2,2,2)( 5 , 2 , 2 , 2 , 2 , 2 , 2 ) and strides (10,3,3,3,3,2,2)10333322(10,3,3,3,3,2,2)( 10 , 3 , 3 , 3 , 3 , 2 , 2 ), resulting in 320x downsampling. Given the raw audio input X𝑋Xitalic_X at a 16000 Hz sample rate, the output representations Z𝑍Zitalic_Z are 50 Hz with dimension 512. Then a linear projection for dimension transformation from 512 to 768 is applied, followed by the mask operation to construct the input for the backbone network \mathcal{B}caligraphic_B. Here we briefly introduce different backbone networks in data2vec and data2vec 2.0.

data2vec

The backbone network \mathcal{B}caligraphic_B contains a 5-layer learnable convolutional positional encoding followed by a 12-layer standard Transformer. Each Transformer block is set to 768 model dimension, 3072 bottleneck dimension, and 12 attention heads. Finally, a linear projection from 768 to 768 is equipped on the student outputs, the results of which are employed to calculate MLM loss with teacher outputs.

data2vec 2.0

The data2vec 2.0 model shares the same Transformer architecture with data2vec, except for one more CNN decoder. The Transformer encoder only encodes the non-masked parts of downsampled features Z𝑍Zitalic_Z, and then the masked parts are complemented with random Gaussian noise before being passed to the CNN decoder, in a MAE-style fashion, to improve efficiency. The CNN decoder is a 4-layer 1-D convolutional neural network with all kernel sizes set to 7, strides set to 1, and channels set to 384, without downsampling. A linear projection from 384 to 768 is equipped to compute MLM loss, which works the same way as data2vec.

4.2 Training Details

Self-supervised Pre-training

In the pre-training phase, we train emotion2vec with 262262262262 hours of unlabeled emotion data shown in Figure 1 with different initial models. For the training overhead, The pre-training is conducted on 4444 NVIDIA A10 Tensor Core GPUs, and we simulate 16161616 GPUs by setting the update frequency to 4444. We train emotion2vec for 100100100100 epochs, each of which takes about 37373737 minutes. We use a dynamic batchsize, where the maximum number of tokens is 1×1061superscript1061\times 10^{6}1 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT. For the optimizing strategy, we use Adam with a learning rate of 7.5×1057.5superscript1057.5\times 10^{-5}7.5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and a weight decay of 1×1021superscript1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. We train emotion2vec using a cosine learning rate scheduler, with 5%percent55\%5 % proportion of linear warm-up. For the student model, each time step of the input has a probability of p=0.5𝑝0.5p=0.5italic_p = 0.5 to be the start index, and the subsequent l=5𝑙5l=5italic_l = 5 time steps are masked. The hyperparameter α𝛼\alphaitalic_α that controls the loss weight is set to 1111. For the teacher model, we use the average of the top k=8𝑘8k=8italic_k = 8 blocks of the transformer layer outputs for providing the training targets. We apply a linearly increasing strategy for τ𝜏\tauitalic_τ from τs=0.999subscript𝜏𝑠0.999\tau_{s}=0.999italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.999 to τe=0.99999subscript𝜏𝑒0.99999\tau_{e}=0.99999italic_τ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 0.99999 for the teacher parameters exponentially moving average.

Supervised Fine-tuning

All model architectures of diverse downstream tasks are designed to be as simple as possible, to demonstrate the representation ability of the pretrained model. For the non-sequential task, following the common practice of SUPERB Yang et al. (2021), we use two linear layers with a ReLU activation function sandwiched between them. For the sequential task, we use two layers of gated recurrent units (GRU) to make predictions.

4.3 Datasets

Table 1: The datasets at a glance for emotion2vec pre-training and downstream tasks.
Dataset Pretrain Downstream Source Emo Spk Lang #Utts #Hours
IEMOCAP Busso et al. (2008) Act 5 10 English 5531 7.0
MELD Poria et al. (2019) Friends TV 7 407 English 13847 12.2
CMU-MOSEI Zadeh et al. (2018) YouTube 7 1000 English 44977 91.9
MEAD Wang et al. (2020) Act 8 60 English 31792 37.3
MSP-Podcast (V1.8) Martinez-Lucas et al. (2020) Podcast 8 10000+ English 72969 113.5
Total English 169053 262.0
CMU-MOSI Zadeh et al. (2016) YouTube 7 89 English 2199 2.6
RAVDESS-Speech Livingstone and Russo (2018) Act 8 24 English 1440 1.5
RAVDESS-Song Livingstone and Russo (2018) Act 8 23 English 1012 1.3
SAVEE Jackson and Haq (2014) Act 7 4 English 480 0.5
M3ED Zhao et al. (2022) TVs 7 626 Mandarin 24449 9.8
EmoDB Burkhardt et al. (2005) Act 7 10 German 535 0.4
EMOVO Costantini et al. (2014) Act 7 10 Italian 588 0.5
CaFE Gournay et al. (2018) Act 7 12 French 936 1.2
SUBESCO Sultana et al. (2021) Act 7 20 Bangla 7000 7.8
ShEMO Mohamad Nezami et al. (2019) Act 6 87 Persian 3000 3.4
URDU Latif et al. (2018) Talk shows 4 38 Urdu 400 0.3
AESDD Vryzas et al. (2018) Act 5 5 Greek 604 0.7
RESD Lubenets et al. Act 7 200 Russian 1396 2.3

A summary of the datasets employed in our experiments is presented in Table 1. There are 18 emotional datasets including 10 different languages: 9 in English, and 1 in Mandarin, Bangla, French, German, Greek, Italian, Persian, Russian, and Urdu. For each dataset, it can be categorized in terms of Pretrain (i.e., whether used during the pre-training phase), Downstream (i.e., whether tested in the downstream task), Source (i.e., where samples collected), Emo (i.e., number of emotion categories), Spk (i.e., number of speakers), Lang, (i.e., Language), #Utts (i.e., number of utterances), and #Hours (i.e., total duration of samples). Speech data is extracted from these datasets and uniformly processed into a single channel of 16k Hz.

In the pretraining phase, we utilize five large-scale English datasets, including IEMOCAP Busso et al. (2008), MELD Poria et al. (2019), MEAD Wang et al. (2020), CMU-MOSEI Zadeh et al. (2018), and MSP-Podcast Martinez-Lucas et al. (2020), resulting in a total of 262 hours. The IEMOCAP corpus contains a total of 5 sessions and 10 different speakers, with each session being a conversation of two exclusive speakers. MELD is a multi-party conversational dataset containing about 13,847 utterances from 1,433 dialogues collected from the TV series ‘Friends’. MEAD is a talking-face video corpus featuring 60 actors and actresses talking with 8 different emotions at three different intensity levels. CMU-MOSEI is a multimodal dataset from YouTube for sentiment and emotion analysis in videos. MSP-Podcast is collected from podcast recordings that discuss a variety of topics like politics, sports, and movies.

Different datasets are used to test different downstream tasks with various languages. For main results in Section 5.2, we report cross-validation (CV) results on the IEMOCAP dataset. The original labels cover five classes, to be consistent and comparable with previous methods Ye et al. (2023); Chen et al. (2023b), we merge ‘excited’ with ‘happy’ to better balance the size of each emotion class, resulting in four classes. We conduct both leave-one-session-out 5-fold CV and leave-one-speaker-out 10-fold CV. Moreover, we report results on MELD under its original split setup, and RAVDESS Livingstone and Russo (2018), SAVEE Jackson and Haq (2014) datasets under a random leave-one-out 10-fold CV setup, which implies at each fold, all samples within the dataset are randomly split into 80%, 10%, and 10% samples in training, validation, and testing sets. Among them, speech in RAVDESS and SAVEE datasets is not seen in the pre-training stage, which demonstrates the generalization of the proposed model on out-of-domain corpora.

For language generalization task in Section 5.3, we report CV results for 9 out-of-domain datasets, including 1 in Mandarin (M3ED Zhao et al. (2022)), Bangla (SUBESCO Sultana et al. (2021)), French (CaFE Gournay et al. (2018)), German (EmoDB Burkhardt et al. (2005)), Greek (AESDD Vryzas et al. (2018)), Italian (EMOVO Costantini et al. (2014)), Persian (ShEMO Mohamad Nezami et al. (2019)), Russain (RESD Lubenets et al. ), and Urdu (URDU Latif et al. (2018)). If not specified, language generalization results are obtained using the random leave-one-out 10-fold CV as we mentioned above unless the dataset provides a set partition. Such as the RESD dataset, we follow its original split setup with 280 testing samples and 1116 training samples. Additionally, we allocate 10% from the training samples for validation and others for training.

For task generalization task in Section 5.4. We tested other speech emotion tasks, including song emotion recognition, emotion prediction in conversation, and sentiment analysis, on RAVDESS-Song Livingstone and Russo (2018), IEMOCAP and CMU-MOSI Zadeh et al. (2016) & CMU-MOSEI Zadeh et al. (2018). For song emotion recognition and emotion prediction in conversation, we report CV results. For sentiment analysis, we report results with its original split setup. To be comparable with previous work, the experimental setup varies according to the specific task.

5 Results

5.1 Evaluation Metrics

We apply commonly used evaluation metrics, weighted accuracy (WA), unweighted accuracy (UA), and weighted average F1 (WF1), to evaluate the performance of speech emotion tasks. WA corresponds to the overall accuracy and UA corresponds to the average class-wise accuracy. WF1 is a comprehensive evaluation, especially for the situation of sample imbalance.

5.2 Main Results

The results are shown in Table 5.2, where we compare different SSL pre-trained models on the IEMOCAP dataset, as well as larger-scale pre-trained models, and the latest specialist models designed for SER tasks. We follow the evaluation of SUPERB Yang et al. (2021), freezing the pre-trained model and training downstream linear layers with the hidden dimensional set to 256. As can be seen from the table, emotion2vec outperforms all existing SSL pre-trained models, across all base models with similar parameters and large models with greater parameters. Compared with Versper-12, an SER model obtained by distillation from WavLM-large, emotion2vec works better with fewer parameters. TIM-NET Ye et al. (2023), MSTR Li et al. (2023b), and DST Chen et al. (2023b) are the latest SER specialist models, respectively, which use different scales of upstream features and downstream networks. The proposed emotion2vec model outperforms or performs on par with these models with only linear layers, while their downstream networks have 2x, 135x, and 114x more parameters than emotion2vec, respectively. We provide the results of leave-one-session-out five-fold cross-validation and leave-one-speaker-out ten-fold cross-validation for reference.

We also conduct experiments on other mainstream English datasets to prove the generalization of emotion2vec in Table 3. MELD is a noisy dataset used to test the SER performance of the model in complex environments. RAVDESS and SAVEE are out-of-domain datasets with respective recording environments. Experimental results show that emotion2vec exhibits state-of-the-art performance on different datasets in different environments.

Table 2: SER task performance of different SSL pre-trained models on the IEMOCAP dataset. The setting of the downstream models follows SUPERB Yang et al. (2021) to use linear layers to test the representation ability of different upstream models. “LS-960" means LibriSpeech 960 hours, “LL-60k" means LibriLight 60k hours, and “Mix-94k" means 94k hours of data including LibriLight, VoxPopuli, and GigaSpeech. For emotion data, “LSED-206" means LSED 206 hours, and “Emo-262" refers to the 262 hours of pre-training data in Table 1. Models are tested using leave-one-session-out five-fold cross-validation with 20% from the training set used as the validation set for each session. Models with underline are leave-one-speaker-out ten-fold cross-validation with 8 speakers for training, 1 speaker for validation, and 1 speaker for testing within each fold. Models with * imply the same fold for both validation and testing, for a fair comparison as some work uses this principle. We also compare with larger-scale pre-trained models and the latest specialist models designed for SER tasks.
Model Pre-training Corpus Upstream #Upstream Params Downstream #Downstream Params WA(%) normal-↑\uparrow
Self-supervised Model
small size
wav2vec (Schneider et al., 2019) LS-960 Proposed 32.54M Linear 0.13M 59.79
vq-wav2vec (Baevski et al., 2019) 34.15M 0.20M 58.24
\hdashlinebase size
wav2vec 2.0 (Baevski et al., 2020) LS-960 Proposed 95.04M Linear 0.20M 63.43
HuBERT (Hsu et al., 2021) LS-960 94.68M 0.20M 64.92
WavLM (Chen et al., 2022) LS-960 94.70M 0.20M 65.94
WavLM+ (Chen et al., 2022) Mix-94k 94.70M 0.20M 67.98
data2vec (Baevski et al., 2022) LS-960 93.75M 0.20M 67.38
data2vec 2.0 (Baevski et al., 2023) LS-960 93.78M 0.20M 68.58
Vesper-4 (Chen et al., 2023a) Mix-94k + LSED-206 63.52 M 0.26M 68.40
Vesper-12 (Chen et al., 2023a) Mix-94k + LSED-206 164.29 M 0.26M 70.70
emotion2vec LS-960 + Emo-262 93.79M 0.20M 71.79
emotion2vec* LS-960 + Emo-262 93.79M 0.20M 74.48
emotion2vec LS-960 + Emo-262 93.79M 0.20M 72.94
emotion2vec* LS-960 + Emo-262 93.79M 0.20M 77.64
\hdashlinelarge size
wav2vec 2.0 (Baevski et al., 2020) LL-60k Proposed 317.38M Linear 0.26M 65.64
HuBERT (Hsu et al., 2021) LL-60k 316.61M 67.62
WavLM (Chen et al., 2022) Mix-94k 316.62M 70.03
Supervised Model
TIM-Net (Ye et al., 2023) - MFCC - CNN(TIM-Net) 0.40M 68.29
MSTR (Li et al., 2023b) HuBERT-large 316.61M Transformer(MSTR) 27.00M 70.03
DST (Chen et al., 2023b) WavLM-large 316.62M Transformer(DST) 22.78M 71.80
Table 3: emotion2vec performance on mainstream English datasets.
Model WA(%) normal-↑\uparrow UA(%) normal-↑\uparrow WF1(%) normal-↑\uparrow WA(%) normal-↑\uparrow UA(%) normal-↑\uparrow WF1(%) normal-↑\uparrow WA(%) normal-↑\uparrow UA(%) normal-↑\uparrow WF1(%) normal-↑\uparrow
MELD RAVDESS SAVEE
WavLM-base 46.95 16.34 35.16 37.01 37.11 36.08 42.08 38.46 38.93
WavLM-base+ 43.78 16.75 34.60 38.89 38.40 37.75 43.54 39.27 42.19
data2vec 45.75 24.98 43.59 69.58 69.70 69.25 82.50 82.26 82.37
data2vec 2.0 48.92 26.10 45.80 81.04 80.80 80.97 83.13 82.94 83.03
emotion2vec 51.88 28.03 48.70 82.43 82.86 82.39 84.38 82.30 84.45
Table 4: emotion2vec performance on datasets of other languages.
Model WA(%) normal-↑\uparrow UA(%) normal-↑\uparrow WF1(%) normal-↑\uparrow WA(%) normal-↑\uparrow UA(%) normal-↑\uparrow WF1(%) normal-↑\uparrow WF1(%) normal-↑\uparrow UA(%) normal-↑\uparrow WF1(%) normal-↑\uparrow
AESD (Gr) CAFE (Fr) RESD (Ru)
WavLM-base 55.33 55.50 54.86 31.61 32.02 30.88 56.17 56.17 55.69
WavLM-base+ 53.83 54.41 52.48 31.40 33.39 30.40 55.00 55.19 55.08
data2vec 56.67 57.26 56.57 57.10 57.68 57.36 49.42 49.77 48.97
data2vec 2.0 71.33 70.20 70.93 71.51 72.98 71.50 64.08 64.33 64.17
emotion2vec 72.33 72.27 71.57 74.52 75.26 74.53 64.75 65.04 64.53
Model EmoDB (De) EMOVO (It) M3ED (Zh)
WavLM-base 59.06 55.32 58.96 40.17 40.34 37.36 44.03 18.90 34.50
WavLM-base+ 65.66 64.60 64.83 40.34 41.98 40.11 45.09 20.18 36.49
data2vec 67.17 64.81 66.52 51.21 51.97 49.82 44.44 21.10 37.77
data2vec 2.0 83.77 83.07 83.93 60.69 61.27 60.79 47.50 24.12 41.74
emotion2vec 84.34 84.85 84.32 61.21 62.97 60.89 49.15 26.98 44.38
Model SUBESCO (Bn) ShEMO (Fa) URDU (Ur)
WavLM-base 54.50 54.77 53.96 67.27 46.60 65.63 71.00 70.25 70.82
WavLM-base+ 54.73 54.69 54.59 66.73 44.29 65.12 67.25 68.68 67.47
data2vec 78.29 78.25 78.21 70.80 53.96 69.84 71.75 72.67 71.83
data2vec 2.0 87.91 87.95 87.90 77.90 62.03 76.96 77.50 78.42 77.12
emotion2vec 90.91 90.96 90.91 79.97 66.04 79.56 81.50 81.87 81.60
Table 5: emotion2vec performance of the song emotion recognition task on the RAVDESS-Song dataset.
Model Upstream Downstream WA(%) normal-↑\uparrow UA(%) normal-↑\uparrow WF1(%) normal-↑\uparrow
Self-supervised Model
WavLM-base Freeze Linear 52.3 52.4 52.1
WavLM-base+ Freeze 54.9 53.9 54.2
data2vec Freeze 63.8 64.1 63.4
data2vec 2.0 Freeze 73.0 74.6 72.7
L33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT-Net (Koh and Dubnov, 2021) Freeze 71.0 - -
SpecMAE (Sadok et al., 2023) Finetune 54.5 - 53.9
VQ-MAE-S (Patch-tf) (Sadok et al., 2023) Finetune 84.0 - 84.0
VQ-MAE-S (Frame) (Sadok et al., 2023) Finetune 84.2 - 84.3
emotion2vec Freeze 85.0 85.2 84.8
Specialist Model
VQ-MAE-S (Patch-tf) (Sadok et al., 2023) Finetune Query2Emo 83.7 - 83.4
VQ-MAE-S (Frame) (Sadok et al., 2023) 85.8 - 85.7