emotion2vec: Self-Supervised Pre-Training
for Speech Emotion Representation

Ziyang Ma¹, Zhisheng Zheng¹, Jiaxin Ye², **chao Li³,
Zhifu Gao⁴, Shiliang Zhang⁴, Xie Chen¹²²2Corresponding author

{}^{1}

Shanghai Jiao Tong University,

{}^{2}

Fudan University,

{}^{3}

The Chinese University of Hong Kong,

{}^{4}

Alibaba

Abstract

We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on open-source unlabeled emotion data through self-supervised online distillation, combining utterance-level loss and frame-level loss during pre-training. emotion2vec outperforms state-of-the-art pre-trained universal models and emotion specialist models by only training linear layers for the speech emotion recognition task on the mainstream IEMOCAP dataset. In addition, emotion2vec shows consistent improvements among 10 different languages of speech emotion recognition datasets. emotion2vec also shows excellent results on other emotion tasks, such as song emotion recognition, emotion prediction in conversation, and sentiment analysis. Comparison experiments, ablation experiments, and visualization comprehensively demonstrate the universal capability of the proposed emotion2vec. To the best of our knowledge, emotion2vec is the first universal representation model in various emotion-related tasks, filling a gap in the field. ¹¹1Code, checkpoints, and extracted features are available at https://github.com/ddlBoJack/emotion2vec

emotion2vec: Self-Supervised Pre-Training
for Speech Emotion Representation

Ziyang Ma¹, Zhisheng Zheng¹, Jiaxin Ye², **chao Li³, Zhifu Gao⁴, Shiliang Zhang⁴, Xie Chen¹²²2Corresponding author ${}^{1}$ Shanghai Jiao Tong University, ${}^{2}$ Fudan University, ${}^{3}$ The Chinese University of Hong Kong, ${}^{4}$ Alibaba

1 Introduction

Extracting emotional representation from speech is an essential step of various emotional tasks such as speech emotion recognition (SER) and sentiment analysis. Traditional methods employ Filter Banks (FBanks) or Mel Frequency Cepstrum Coefficients (MFCCs) as speech features. These features are not rich in semantic information, resulting in limited performance on emotional tasks. Popular methods utilize features extracted from speech-based self-supervised learning (SSL) pre-trained models, leading to a significant performance improvement.

One potential challenge blocking further performance improvement is that these SSL models are not entirely suitable for emotional tasks. Wang et al. (2021) explore no fine-tuning, partial fine-tuning, and entire fine-tuning with some SSL models for SER on the IEMOCAP dataset Busso et al. (2008), and give some empirical conclusions. While this is an ad-hoc solution, on the one hand, fine-tuning SSL models requires a large computational cost, on the other hand, these conclusions may be data-specific or model-constrained. Recently, Chen et al. (2023a) proposed an SER model named Vesper, which is obtained by model distillation from WavLM-large Chen et al. (2022) with emotion data. Vesper is designed to perform the SER task, whose universal representation capability still needs to be demonstrated. Accordingly, a universal speech-based emotion representation model is urgently needed in the field.

Here we propose emotion2vec, a universal emotion representation model that can be used to extract speech features for diverse emotion tasks. Self-supervised pre-training is performed on 262 hours of open-source emotion data with an online distillation paradigm to obtain emotion2vec. Considering that both whole-play information and local details convey emotion, we propose a pre-training strategy combining utterance-level loss and frame-level loss. On the mainstream IEMOCAP dataset, the downstream linear model trained with features extracted from emotion2vec outperforms all the mainstream SSL models and the latest specialist models. emotion2vec is tested on 13 datasets including 10 languages, and the results show that emotion2vec exhibits language generalization ability. Moreover, in addition to the SER task, we also experimented with emotion2vec features on song emotion recognition, emotion prediction in conversation, and sentiment analysis. The results indicate that emotion2vec has excellent task generalization ability. Extensive ablation experiments and visualization analysis demonstrate the effectiveness of our pre-training methods and the versatility of the proposed emotion2vec model.

2 Related Work

2.1 Speech-based SSL

Self-supervised learning has achieved remarkable success in the field of representation learning, showcasing its efficacy across natural language processing Devlin et al. (2019); Liu et al. (2019); Radford et al. (2019); Brown et al. (2020), computer vision Grill et al. (2020); He et al. (2020); Bao et al. (2021); He et al. (2022), as well as speech processing Baevski et al. (2020); Hsu et al. (2021); Chen et al. (2022); Baevski et al. (2022). For speech representation learning, all SSL models can be classified into two categories according to the self-supervised targets utilized during pre-training Ma et al. (2023b): 1) Offline targets. 2) Online targets. Models employing offline targets often require a well-trained teacher model before the pre-training stage, to extract self-supervised targets. Representative models of this type are HuBERT Hsu et al. (2021), WavLM Chen et al. (2022) using K-means targets, and PBERT Wang et al. (2022), MonoBERT&PolyBERT Ma et al. (2023c) using phoneme-based targets. Models using online targets do not need a pre-trained teacher model in advance, while the teacher models are constantly updated during the pre-training phase, with an online distillation paradigm. Representative models of this type are data2vec Baevski et al. (2022), data2vec 2.0 Baevski et al. (2023) using frame-level mask language model (MLM) loss, and CA-DINO Han et al. (2023) using utterance-level cross-entropy loss. emotion2vec is pre-trained combining both utterance-level loss and frame-level loss, leading to a superior speech emotion representation model.

2.2 Speech Emotion Representation

We present the first universal speech emotion representation model, whereas most of the previous works directly employ speech pre-training models Pepino et al. (2021); Li et al. (2022), or fine-tune the pre-training models on their specific emotional data with specific emotional tasks (mostly SER) Morais et al. (2022); Chen and Rudnicky (2023), to extract speech emotion representation. A series of works investigate the SER performance of wav2vec 2.0 Wang et al. (2021), HuBERT Wang et al. (2021), as well as WavLM Ioannides et al. (2023), either fine-tuning or not. A recent work Ma et al. (2023a) found that data2vec features also have a good representation ability in the SER task. For speech emotion representation in other emotion tasks, such as multimodal emotion recognition, popular practice Li et al. (2023a) is similar to what is mentioned above.

3 Methods

Here we mainly introduce the self-supervised pre-training method of the proposed emotion2vec, for which the core is to train the model with Utterance-level Loss and Frame-level Loss using Online Distillation paradigm.

Refer to caption — Figure 1: The overall framework of emotion2vec. During the pre-training phase, emotion2vec conducts online distillation with a teacher network and a student network. When a specific downstream task is performed, emotion2vec is frozen and a lightweight downstream model is trained.

3.1 Model Pipeline

As shown in Figure 1, emotion2vec contains two networks in the pre-training phase, which are the teacher network $\mathcal{T}$ and the student network $\mathcal{S}$ . Both models share the same model architecture, including a feature extractor $\mathcal{F}$ composed of multi-layer convolutional neural networks and a backbone network $\mathcal{B}$ composed of multi-layer Transformers. These modules can be configured with different architectures, which will be described in Section 4.1. Given a raw audio utterance $X=[x_{1},\cdots,x_{N_{x}}]$ , the Teacher $\mathcal{T}$ and the Student $\mathcal{S}$ respectively utilize feature extractors $\mathcal{F}^{\mathcal{T}}$ and $\mathcal{F}^{\mathcal{S}}$ to obtain the downsampled features $Z=[z_{1},\cdots,z_{N_{z}}]$ , which can be written as:

Z^{\mathcal{T}}=\mathcal{F}^{\mathcal{T}}(X),

(1)

Z^{\mathcal{S}}=\mathcal{F}^{\mathcal{S}}(X).

(2)

For the teacher network $\mathcal{T}$ , the downsampled features $Z^{\mathcal{T}}$ are directly fed into the backbone network $\mathcal{B}^{\mathcal{T}}$ . For the student network $\mathcal{S}$ , the downsampled features $Z^{\mathcal{S}}$ are masked $l$ consecutive frames with probability $p$ for each frame as the start. Then learnable utterance embedding $U=[u_{1},\cdots,u_{N_{u}}]$ is placed in the front before being fed into the backbone network $\mathcal{B}^{\mathcal{S}}$ . The formula can be written as follows:

Y^{\mathcal{T}}=\frac{1}{k}\sum_{i=1}^{k}\mathcal{B}_{i}^{\mathcal{T}}(Z^{% \mathcal{T}}),

(3)

U^{\mathcal{S}};Y^{\mathcal{S}}=\mathcal{B}^{\mathcal{S}}(U;Mask(Z^{\mathcal{S% }})),

(4)

where $Y^{\mathcal{T}}$ is the average of the output embedding of the top $k$ layer Transformer Block $\mathcal{B}_{i}^{\mathcal{T}}$ . Utterance-level output embedding $U^{\mathcal{S}}$ and frame-level output embedding $Y^{\mathcal{S}}$ are the outputs of the student backbone network $\mathcal{B}^{\mathcal{S}}$ . $Mask$ is the applying mask operation. $Y^{\mathcal{T}}$ , $Y^{\mathcal{S}}$ and $U^{\mathcal{S}}$ are the same in the hidden layer dimensions, where $Y^{\mathcal{T}}$ and $Y^{\mathcal{S}}$ have the same $N_{z}$ temporal dimensions, while $U^{\mathcal{S}}$ has $N_{u}$ temporal dimensions, respectively.

3.2 Utterance-level Loss

Utterance-level loss constructs an utterance-level pretext task to learn the global emotion. We use mean squared error (MSE) to calculate the loss, which can be written as:

L_{Utt}=(\bar{Y}^{\mathcal{T}}-\bar{U}^{\mathcal{S}})^{2},

(5)

where

\bar{Y}^{\mathcal{T}}=\frac{1}{N_{z}}\sum_{i=1}^{N_{z}}Y_{i}^{\mathcal{T}},

(6)

\bar{U}^{\mathcal{S}}=\frac{1}{N_{u}}\sum_{i=1}^{N_{u}}U_{i}^{\mathcal{S}},

(7)

which means that utterance-level loss $L_{Utt}$ is computed by temporal pooling results of $Y^{\mathcal{T}}$ and $U^{\mathcal{S}}$ . Here we propose three ways to compute utterance-level loss, which we call token embedding, chunk embedding, and global embedding, as shown in Figure 2.

Token Embedding

Token embedding employs a single token to represent global emotion information encoded by the student network $\mathcal{S}$ . More explicitly, we set $N_{u}$ to 1 in the learnable utterance embedding $U=[u_{1},\cdots,u_{N_{u}}]$ .

Chunk Embedding

Chunk embedding employs multiple tokens to represent global emotion information. In this case, more global information can be aggregated within the chunk.

Global Embedding

In the case of utilizing global embedding, no additional utterance tokens are added. We use temporal pooling of frame-level output embedding $Y^{\mathcal{S}}$ instead of $U^{\mathcal{S}}$ to compute the loss.

3.3 Frame-level Loss

Frame-level loss constructs a frame-wise pretext task to learn the context emotion. We only compute the loss on the masked part, which is the common practice for a mask language modeling(MLM) pretext task. The frame-level loss $L_{Frm}$ can be expressed as:

L_{Frm}=\frac{1}{M}\sum_{i\in\mathbb{M}}(Y_{i}^{\mathcal{T}}-Y_{i}^{\mathcal{S% }})^{2},

(8)

where $\mathbb{M}$ denotes the index sequence of frame-level output embedding $Y^{\mathcal{S}}$ being masked, and $M$ denotes the total number of tokens being masked.

3.4 Online Distillation

Online distillation is a self-supervised learning strategy for teacher-student learning, where the student network updates parameters by backpropagation and the teacher network updates parameters with an exponentially moving average (EMA) (Grill et al., 2020). For the student network $\mathcal{S}$ , the total loss $L$ for backpropagation is a combination of frame-level loss $L_{Frm}$ and utterance-level loss $L_{Utt}$ , donated as:

L=L_{Frm}+\alpha L_{Utt},

(9)

with a tunable weight $\alpha$ . For the teacher network $\mathcal{T}$ , The parameters $\theta_{0}^{\mathcal{T}}$ are initialized as the same parameters of the student network $\theta_{0}^{\mathcal{S}}$ , and then are updated with EMA within each mini-batch, donated as:

\theta_{t+1}^{\mathcal{T}}=\tau\theta_{t}^{\mathcal{T}}+(1-\tau)\theta_{t+1}^{% \mathcal{S}}.

(10)

where $\tau$ is a parameter that increases linearly during pre-training. In practice, within each mini-batch the parameters of teacher feature extractor $\mathcal{F}^{\mathcal{T}}$ are copied directly from $\mathcal{F}^{\mathcal{S}}$ , while the parameters of teacher backbone network $\mathcal{B}^{\mathcal{T}}$ are updated with EMA from $\mathcal{B}^{\mathcal{T}}$ and $\mathcal{B}^{\mathcal{S}}$ .

4 Experiments Setup

4.1 Initial Model

Different initial models lead to different architectures of feature extractors $\mathcal{F}$ , backbone networks $\mathcal{B}$ , and initialization parameters $\theta_{0}$ . Here we adopt two models, data2vec ²²2https://dl.fbaipublicfiles.com/fairseq/data2vec/audio_base_ls.pt and data2vec 2.0 ³³3https://dl.fbaipublicfiles.com/fairseq/data2vec2/base_libri.pt, both of which have the same feature extractor design but different backbone network designs. The feature extractor $\mathcal{F}$ is a 7-layer 1-D convolutional neural network with kernel sizes $(5,2,2,2,2,2,2)$ and strides $(10,3,3,3,3,2,2)$ , resulting in 320x downsampling. Given the raw audio input $X$ at a 16000 Hz sample rate, the output representations $Z$ are 50 Hz with dimension 512. Then a linear projection for dimension transformation from 512 to 768 is applied, followed by the mask operation to construct the input for the backbone network $\mathcal{B}$ . Here we briefly introduce different backbone networks in data2vec and data2vec 2.0.

data2vec

The backbone network $\mathcal{B}$ contains a 5-layer learnable convolutional positional encoding followed by a 12-layer standard Transformer. Each Transformer block is set to 768 model dimension, 3072 bottleneck dimension, and 12 attention heads. Finally, a linear projection from 768 to 768 is equipped on the student outputs, the results of which are employed to calculate MLM loss with teacher outputs.

data2vec 2.0

The data2vec 2.0 model shares the same Transformer architecture with data2vec, except for one more CNN decoder. The Transformer encoder only encodes the non-masked parts of downsampled features $Z$ , and then the masked parts are complemented with random Gaussian noise before being passed to the CNN decoder, in a MAE-style fashion, to improve efficiency. The CNN decoder is a 4-layer 1-D convolutional neural network with all kernel sizes set to 7, strides set to 1, and channels set to 384, without downsampling. A linear projection from 384 to 768 is equipped to compute MLM loss, which works the same way as data2vec.

4.2 Training Details

Self-supervised Pre-training

In the pre-training phase, we train emotion2vec with $262$ hours of unlabeled emotion data shown in Figure 1 with different initial models. For the training overhead, The pre-training is conducted on $4$ NVIDIA A10 Tensor Core GPUs, and we simulate $16$ GPUs by setting the update frequency to $4$ . We train emotion2vec for $100$ epochs, each of which takes about $37$ minutes. We use a dynamic batchsize, where the maximum number of tokens is $1\times 10^{6}$ . For the optimizing strategy, we use Adam with a learning rate of $7.5\times 10^{-5}$ and a weight decay of $1\times 10^{-2}$ . We train emotion2vec using a cosine learning rate scheduler, with $5\%$ proportion of linear warm-up. For the student model, each time step of the input has a probability of $p=0.5$ to be the start index, and the subsequent $l=5$ time steps are masked. The hyperparameter $\alpha$ that controls the loss weight is set to $1$ . For the teacher model, we use the average of the top $k=8$ blocks of the transformer layer outputs for providing the training targets. We apply a linearly increasing strategy for $\tau$ from $\tau_{s}=0.999$ to $\tau_{e}=0.99999$ for the teacher parameters exponentially moving average.

Supervised Fine-tuning

All model architectures of diverse downstream tasks are designed to be as simple as possible, to demonstrate the representation ability of the pretrained model. For the non-sequential task, following the common practice of SUPERB Yang et al. (2021), we use two linear layers with a ReLU activation function sandwiched between them. For the sequential task, we use two layers of gated recurrent units (GRU) to make predictions.

4.3 Datasets

Table 1: The datasets at a glance for emotion2vec pre-training and downstream tasks.

Dataset	Pretrain	Downstream	Source	Emo	Spk	Lang	#Utts	#Hours
IEMOCAP Busso et al. (2008)	✓	✓	Act	5	10	English	5531	7.0
MELD Poria et al. (2019)	✓	✓	Friends TV	7	407	English	13847	12.2
CMU-MOSEI Zadeh et al. (2018)	✓	✓	YouTube	7	1000	English	44977	91.9
MEAD Wang et al. (2020)	✓	✗	Act	8	60	English	31792	37.3
MSP-Podcast (V1.8) Martinez-Lucas et al. (2020)	✓	✗	Podcast	8	10000+	English	72969	113.5
Total	✓	–	–	–	–	English	169053	262.0
CMU-MOSI Zadeh et al. (2016)	✗	✓	YouTube	7	89	English	2199	2.6
RAVDESS-Speech Livingstone and Russo (2018)	✗	✓	Act	8	24	English	1440	1.5
RAVDESS-Song Livingstone and Russo (2018)	✗	✓	Act	8	23	English	1012	1.3
SAVEE Jackson and Haq (2014)	✗	✓	Act	7	4	English	480	0.5
M3ED Zhao et al. (2022)	✗	✓	TVs	7	626	Mandarin	24449	9.8
EmoDB Burkhardt et al. (2005)	✗	✓	Act	7	10	German	535	0.4
EMOVO Costantini et al. (2014)	✗	✓	Act	7	10	Italian	588	0.5
CaFE Gournay et al. (2018)	✗	✓	Act	7	12	French	936	1.2
SUBESCO Sultana et al. (2021)	✗	✓	Act	7	20	Bangla	7000	7.8
ShEMO Mohamad Nezami et al. (2019)	✗	✓	Act	6	87	Persian	3000	3.4
URDU Latif et al. (2018)	✗	✓	Talk shows	4	38	Urdu	400	0.3
AESDD Vryzas et al. (2018)	✗	✓	Act	5	5	Greek	604	0.7
RESD Lubenets et al.	✗	✓	Act	7	200	Russian	1396	2.3

A summary of the datasets employed in our experiments is presented in Table 1. There are 18 emotional datasets including 10 different languages: 9 in English, and 1 in Mandarin, Bangla, French, German, Greek, Italian, Persian, Russian, and Urdu. For each dataset, it can be categorized in terms of Pretrain (i.e., whether used during the pre-training phase), Downstream (i.e., whether tested in the downstream task), Source (i.e., where samples collected), Emo (i.e., number of emotion categories), Spk (i.e., number of speakers), Lang, (i.e., Language), #Utts (i.e., number of utterances), and #Hours (i.e., total duration of samples). Speech data is extracted from these datasets and uniformly processed into a single channel of 16k Hz.

In the pretraining phase, we utilize five large-scale English datasets, including IEMOCAP Busso et al. (2008), MELD Poria et al. (2019), MEAD Wang et al. (2020), CMU-MOSEI Zadeh et al. (2018), and MSP-Podcast Martinez-Lucas et al. (2020), resulting in a total of 262 hours. The IEMOCAP corpus contains a total of 5 sessions and 10 different speakers, with each session being a conversation of two exclusive speakers. MELD is a multi-party conversational dataset containing about 13,847 utterances from 1,433 dialogues collected from the TV series ‘Friends’. MEAD is a talking-face video corpus featuring 60 actors and actresses talking with 8 different emotions at three different intensity levels. CMU-MOSEI is a multimodal dataset from YouTube for sentiment and emotion analysis in videos. MSP-Podcast is collected from podcast recordings that discuss a variety of topics like politics, sports, and movies.

Different datasets are used to test different downstream tasks with various languages. For main results in Section 5.2, we report cross-validation (CV) results on the IEMOCAP dataset. The original labels cover five classes, to be consistent and comparable with previous methods Ye et al. (2023); Chen et al. (2023b), we merge ‘excited’ with ‘happy’ to better balance the size of each emotion class, resulting in four classes. We conduct both leave-one-session-out 5-fold CV and leave-one-speaker-out 10-fold CV. Moreover, we report results on MELD under its original split setup, and RAVDESS Livingstone and Russo (2018), SAVEE Jackson and Haq (2014) datasets under a random leave-one-out 10-fold CV setup, which implies at each fold, all samples within the dataset are randomly split into 80%, 10%, and 10% samples in training, validation, and testing sets. Among them, speech in RAVDESS and SAVEE datasets is not seen in the pre-training stage, which demonstrates the generalization of the proposed model on out-of-domain corpora.

For language generalization task in Section 5.3, we report CV results for 9 out-of-domain datasets, including 1 in Mandarin (M3ED Zhao et al. (2022)), Bangla (SUBESCO Sultana et al. (2021)), French (CaFE Gournay et al. (2018)), German (EmoDB Burkhardt et al. (2005)), Greek (AESDD Vryzas et al. (2018)), Italian (EMOVO Costantini et al. (2014)), Persian (ShEMO Mohamad Nezami et al. (2019)), Russain (RESD Lubenets et al. ), and Urdu (URDU Latif et al. (2018)). If not specified, language generalization results are obtained using the random leave-one-out 10-fold CV as we mentioned above unless the dataset provides a set partition. Such as the RESD dataset, we follow its original split setup with 280 testing samples and 1116 training samples. Additionally, we allocate 10% from the training samples for validation and others for training.

For task generalization task in Section 5.4. We tested other speech emotion tasks, including song emotion recognition, emotion prediction in conversation, and sentiment analysis, on RAVDESS-Song Livingstone and Russo (2018), IEMOCAP and CMU-MOSI Zadeh et al. (2016) & CMU-MOSEI Zadeh et al. (2018). For song emotion recognition and emotion prediction in conversation, we report CV results. For sentiment analysis, we report results with its original split setup. To be comparable with previous work, the experimental setup varies according to the specific task.

5 Results

5.1 Evaluation Metrics

We apply commonly used evaluation metrics, weighted accuracy (WA), unweighted accuracy (UA), and weighted average F1 (WF1), to evaluate the performance of speech emotion tasks. WA corresponds to the overall accuracy and UA corresponds to the average class-wise accuracy. WF1 is a comprehensive evaluation, especially for the situation of sample imbalance.

5.2 Main Results

The results are shown in Table 5.2, where we compare different SSL pre-trained models on the IEMOCAP dataset, as well as larger-scale pre-trained models, and the latest specialist models designed for SER tasks. We follow the evaluation of SUPERB Yang et al. (2021), freezing the pre-trained model and training downstream linear layers with the hidden dimensional set to 256. As can be seen from the table, emotion2vec outperforms all existing SSL pre-trained models, across all base models with similar parameters and large models with greater parameters. Compared with Versper-12, an SER model obtained by distillation from WavLM-large, emotion2vec works better with fewer parameters. TIM-NET Ye et al. (2023), MSTR Li et al. (2023b), and DST Chen et al. (2023b) are the latest SER specialist models, respectively, which use different scales of upstream features and downstream networks. The proposed emotion2vec model outperforms or performs on par with these models with only linear layers, while their downstream networks have 2x, 135x, and 114x more parameters than emotion2vec, respectively. We provide the results of leave-one-session-out five-fold cross-validation and leave-one-speaker-out ten-fold cross-validation for reference.

We also conduct experiments on other mainstream English datasets to prove the generalization of emotion2vec in Table 3. MELD is a noisy dataset used to test the SER performance of the model in complex environments. RAVDESS and SAVEE are out-of-domain datasets with respective recording environments. Experimental results show that emotion2vec exhibits state-of-the-art performance on different datasets in different environments.

Table 2: SER task performance of different SSL pre-trained models on the IEMOCAP dataset. The setting of the downstream models follows SUPERB Yang et al. (2021) to use linear layers to test the representation ability of different upstream models. “LS-960" means LibriSpeech 960 hours, “LL-60k" means LibriLight 60k hours, and “Mix-94k" means 94k hours of data including LibriLight, VoxPopuli, and GigaSpeech. For emotion data, “LSED-206" means LSED 206 hours, and “Emo-262" refers to the 262 hours of pre-training data in Table 1. Models are tested using leave-one-session-out five-fold cross-validation with 20% from the training set used as the validation set for each session. Models with underline are leave-one-speaker-out ten-fold cross-validation with 8 speakers for training, 1 speaker for validation, and 1 speaker for testing within each fold. Models with * imply the same fold for both validation and testing, for a fair comparison as some work uses this principle. We also compare with larger-scale pre-trained models and the latest specialist models designed for SER tasks.

Self-supervised Model
Model	Pre-training Corpus	Upstream	#Upstream Params	Downstream	#Downstream Params	WA(%) $\uparrow$
small size
wav2vec (Schneider et al., 2019)	LS-960	Proposed	32.54M	Linear	0.13M	59.79
vq-wav2vec (Baevski et al., 2019)	LS-960	Proposed	34.15M	Linear	0.20M	58.24
\hdashlinebase size
wav2vec 2.0 (Baevski et al., 2020)	LS-960	Proposed	95.04M	Linear	0.20M	63.43
HuBERT (Hsu et al., 2021)	LS-960		94.68M		0.20M	64.92
WavLM (Chen et al., 2022)	LS-960		94.70M		0.20M	65.94
WavLM+ (Chen et al., 2022)	Mix-94k		94.70M		0.20M	67.98
data2vec (Baevski et al., 2022)	LS-960		93.75M		0.20M	67.38
data2vec 2.0 (Baevski et al., 2023)	LS-960		93.78M		0.20M	68.58
Vesper-4 (Chen et al., 2023a)	Mix-94k + LSED-206		63.52 M		0.26M	68.40
Vesper-12 (Chen et al., 2023a)	Mix-94k + LSED-206		164.29 M		0.26M	70.70
emotion2vec	LS-960 + Emo-262		93.79M		0.20M	71.79
emotion2vec*	LS-960 + Emo-262		93.79M		0.20M	74.48
emotion2vec	LS-960 + Emo-262		93.79M		0.20M	72.94
emotion2vec*	LS-960 + Emo-262		93.79M		0.20M	77.64
\hdashlinelarge size
wav2vec 2.0 (Baevski et al., 2020)	LL-60k	Proposed	317.38M	Linear	0.26M	65.64
HuBERT (Hsu et al., 2021)	LL-60k		316.61M			67.62
WavLM (Chen et al., 2022)	Mix-94k		316.62M			70.03
Supervised Model
TIM-Net (Ye et al., 2023)	-	MFCC	-	CNN(TIM-Net)	0.40M	68.29
MSTR (Li et al., 2023b)		HuBERT-large	316.61M	Transformer(MSTR)	27.00M	70.03
DST (Chen et al., 2023b)		WavLM-large	316.62M	Transformer(DST)	22.78M	71.80

Table 3: emotion2vec performance on mainstream English datasets.

Model	WA(%) $\uparrow$	UA(%) $\uparrow$	WF1(%) $\uparrow$	WA(%) $\uparrow$	UA(%) $\uparrow$	WF1(%) $\uparrow$	WA(%) $\uparrow$	UA(%) $\uparrow$	WF1(%) $\uparrow$
Model	MELD			RAVDESS			SAVEE
WavLM-base	46.95	16.34	35.16	37.01	37.11	36.08	42.08	38.46	38.93
WavLM-base+	43.78	16.75	34.60	38.89	38.40	37.75	43.54	39.27	42.19
data2vec	45.75	24.98	43.59	69.58	69.70	69.25	82.50	82.26	82.37
data2vec 2.0	48.92	26.10	45.80	81.04	80.80	80.97	83.13	82.94	83.03
emotion2vec	51.88	28.03	48.70	82.43	82.86	82.39	84.38	82.30	84.45

Table 4: emotion2vec performance on datasets of other languages.

Model	WA(%) $\uparrow$	UA(%) $\uparrow$	WF1(%) $\uparrow$	WA(%) $\uparrow$	UA(%) $\uparrow$	WF1(%) $\uparrow$	WF1(%) $\uparrow$	UA(%) $\uparrow$	WF1(%) $\uparrow$
Model	AESD (Gr)			CAFE (Fr)			RESD (Ru)
WavLM-base	55.33	55.50	54.86	31.61	32.02	30.88	56.17	56.17	55.69
WavLM-base+	53.83	54.41	52.48	31.40	33.39	30.40	55.00	55.19	55.08
data2vec	56.67	57.26	56.57	57.10	57.68	57.36	49.42	49.77	48.97
data2vec 2.0	71.33	70.20	70.93	71.51	72.98	71.50	64.08	64.33	64.17
emotion2vec	72.33	72.27	71.57	74.52	75.26	74.53	64.75	65.04	64.53
Model	EmoDB (De)			EMOVO (It)			M3ED (Zh)
WavLM-base	59.06	55.32	58.96	40.17	40.34	37.36	44.03	18.90	34.50
WavLM-base+	65.66	64.60	64.83	40.34	41.98	40.11	45.09	20.18	36.49
data2vec	67.17	64.81	66.52	51.21	51.97	49.82	44.44	21.10	37.77
data2vec 2.0	83.77	83.07	83.93	60.69	61.27	60.79	47.50	24.12	41.74
emotion2vec	84.34	84.85	84.32	61.21	62.97	60.89	49.15	26.98	44.38
Model	SUBESCO (Bn)			ShEMO (Fa)			URDU (Ur)
WavLM-base	54.50	54.77	53.96	67.27	46.60	65.63	71.00	70.25	70.82
WavLM-base+	54.73	54.69	54.59	66.73	44.29	65.12	67.25	68.68	67.47
data2vec	78.29	78.25	78.21	70.80	53.96	69.84	71.75	72.67	71.83
data2vec 2.0	87.91	87.95	87.90	77.90	62.03	76.96	77.50	78.42	77.12
emotion2vec	90.91	90.96	90.91	79.97	66.04	79.56	81.50	81.87	81.60

Table 5: emotion2vec performance of the song emotion recognition task on the RAVDESS-Song dataset.

Self-supervised Model
Model	Upstream	Downstream	WA(%) $\uparrow$	UA(%) $\uparrow$	WF1(%) $\uparrow$
WavLM-base	Freeze	Linear	52.3	52.4	52.1
WavLM-base+	Freeze		54.9	53.9	54.2
data2vec	Freeze		63.8	64.1	63.4
data2vec 2.0	Freeze		73.0	74.6	72.7
L ${}^{3}$ -Net (Koh and Dubnov, 2021)	Freeze		71.0	-	-
SpecMAE (Sadok et al., 2023)	Finetune		54.5	-	53.9
VQ-MAE-S (Patch-tf) (Sadok et al., 2023)	Finetune		84.0	-	84.0
VQ-MAE-S (Frame) (Sadok et al., 2023)	Finetune		84.2	-	84.3
emotion2vec	Freeze		85.0	85.2	84.8
Specialist Model
VQ-MAE-S (Patch-tf) (Sadok et al., 2023)	Finetune	Query2Emo	83.7	-	83.4
VQ-MAE-S (Frame) (Sadok et al., 2023)	Finetune	Query2Emo	85.8	-	85.7

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Abstract

1 Introduction

2 Related Work

2.1 Speech-based SSL

2.2 Speech Emotion Representation

3 Methods

3.1 Model Pipeline

3.2 Utterance-level Loss

Token Embedding

Chunk Embedding

Global Embedding

3.3 Frame-level Loss

3.4 Online Distillation

4 Experiments Setup

4.1 Initial Model

data2vec

data2vec 2.0

4.2 Training Details

Self-supervised Pre-training

Supervised Fine-tuning

4.3 Datasets

5 Results

5.1 Evaluation Metrics

5.2 Main Results

emotion2vec: Self-Supervised Pre-Training
for Speech Emotion Representation