emotion2vec: Self-Supervised Pre-Training
for Speech Emotion Representation
Abstract
We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on open-source unlabeled emotion data through self-supervised online distillation, combining utterance-level loss and frame-level loss during pre-training. emotion2vec outperforms state-of-the-art pre-trained universal models and emotion specialist models by only training linear layers for the speech emotion recognition task on the mainstream IEMOCAP dataset. In addition, emotion2vec shows consistent improvements among 10 different languages of speech emotion recognition datasets. emotion2vec also shows excellent results on other emotion tasks, such as song emotion recognition, emotion prediction in conversation, and sentiment analysis. Comparison experiments, ablation experiments, and visualization comprehensively demonstrate the universal capability of the proposed emotion2vec. To the best of our knowledge, emotion2vec is the first universal representation model in various emotion-related tasks, filling a gap in the field. 111Code, checkpoints, and extracted features are available at https://github.com/ddlBoJack/emotion2vec
emotion2vec: Self-Supervised Pre-Training
for Speech Emotion Representation
Ziyang Ma1, Zhisheng Zheng1, Jiaxin Ye2, **chao Li3, Zhifu Gao4, Shiliang Zhang4, Xie Chen1222Corresponding author Shanghai Jiao Tong University, Fudan University, The Chinese University of Hong Kong, Alibaba
1 Introduction
Extracting emotional representation from speech is an essential step of various emotional tasks such as speech emotion recognition (SER) and sentiment analysis. Traditional methods employ Filter Banks (FBanks) or Mel Frequency Cepstrum Coefficients (MFCCs) as speech features. These features are not rich in semantic information, resulting in limited performance on emotional tasks. Popular methods utilize features extracted from speech-based self-supervised learning (SSL) pre-trained models, leading to a significant performance improvement.
One potential challenge blocking further performance improvement is that these SSL models are not entirely suitable for emotional tasks. Wang et al. (2021) explore no fine-tuning, partial fine-tuning, and entire fine-tuning with some SSL models for SER on the IEMOCAP dataset Busso et al. (2008), and give some empirical conclusions. While this is an ad-hoc solution, on the one hand, fine-tuning SSL models requires a large computational cost, on the other hand, these conclusions may be data-specific or model-constrained. Recently, Chen et al. (2023a) proposed an SER model named Vesper, which is obtained by model distillation from WavLM-large Chen et al. (2022) with emotion data. Vesper is designed to perform the SER task, whose universal representation capability still needs to be demonstrated. Accordingly, a universal speech-based emotion representation model is urgently needed in the field.
Here we propose emotion2vec, a universal emotion representation model that can be used to extract speech features for diverse emotion tasks. Self-supervised pre-training is performed on 262 hours of open-source emotion data with an online distillation paradigm to obtain emotion2vec. Considering that both whole-play information and local details convey emotion, we propose a pre-training strategy combining utterance-level loss and frame-level loss. On the mainstream IEMOCAP dataset, the downstream linear model trained with features extracted from emotion2vec outperforms all the mainstream SSL models and the latest specialist models. emotion2vec is tested on 13 datasets including 10 languages, and the results show that emotion2vec exhibits language generalization ability. Moreover, in addition to the SER task, we also experimented with emotion2vec features on song emotion recognition, emotion prediction in conversation, and sentiment analysis. The results indicate that emotion2vec has excellent task generalization ability. Extensive ablation experiments and visualization analysis demonstrate the effectiveness of our pre-training methods and the versatility of the proposed emotion2vec model.
2 Related Work
2.1 Speech-based SSL
Self-supervised learning has achieved remarkable success in the field of representation learning, showcasing its efficacy across natural language processing Devlin et al. (2019); Liu et al. (2019); Radford et al. (2019); Brown et al. (2020), computer vision Grill et al. (2020); He et al. (2020); Bao et al. (2021); He et al. (2022), as well as speech processing Baevski et al. (2020); Hsu et al. (2021); Chen et al. (2022); Baevski et al. (2022). For speech representation learning, all SSL models can be classified into two categories according to the self-supervised targets utilized during pre-training Ma et al. (2023b): 1) Offline targets. 2) Online targets. Models employing offline targets often require a well-trained teacher model before the pre-training stage, to extract self-supervised targets. Representative models of this type are HuBERT Hsu et al. (2021), WavLM Chen et al. (2022) using K-means targets, and PBERT Wang et al. (2022), MonoBERT&PolyBERT Ma et al. (2023c) using phoneme-based targets. Models using online targets do not need a pre-trained teacher model in advance, while the teacher models are constantly updated during the pre-training phase, with an online distillation paradigm. Representative models of this type are data2vec Baevski et al. (2022), data2vec 2.0 Baevski et al. (2023) using frame-level mask language model (MLM) loss, and CA-DINO Han et al. (2023) using utterance-level cross-entropy loss. emotion2vec is pre-trained combining both utterance-level loss and frame-level loss, leading to a superior speech emotion representation model.
2.2 Speech Emotion Representation
We present the first universal speech emotion representation model, whereas most of the previous works directly employ speech pre-training models Pepino et al. (2021); Li et al. (2022), or fine-tune the pre-training models on their specific emotional data with specific emotional tasks (mostly SER) Morais et al. (2022); Chen and Rudnicky (2023), to extract speech emotion representation. A series of works investigate the SER performance of wav2vec 2.0 Wang et al. (2021), HuBERT Wang et al. (2021), as well as WavLM Ioannides et al. (2023), either fine-tuning or not. A recent work Ma et al. (2023a) found that data2vec features also have a good representation ability in the SER task. For speech emotion representation in other emotion tasks, such as multimodal emotion recognition, popular practice Li et al. (2023a) is similar to what is mentioned above.
3 Methods
Here we mainly introduce the self-supervised pre-training method of the proposed emotion2vec, for which the core is to train the model with Utterance-level Loss and Frame-level Loss using Online Distillation paradigm.
![Refer to caption](x1.png)
3.1 Model Pipeline
As shown in Figure 1, emotion2vec contains two networks in the pre-training phase, which are the teacher network and the student network . Both models share the same model architecture, including a feature extractor composed of multi-layer convolutional neural networks and a backbone network composed of multi-layer Transformers. These modules can be configured with different architectures, which will be described in Section 4.1. Given a raw audio utterance , the Teacher and the Student respectively utilize feature extractors and to obtain the downsampled features , which can be written as:
(1) |
(2) |
For the teacher network , the downsampled features are directly fed into the backbone network . For the student network , the downsampled features are masked consecutive frames with probability for each frame as the start. Then learnable utterance embedding is placed in the front before being fed into the backbone network . The formula can be written as follows:
(3) |
(4) |
where is the average of the output embedding of the top layer Transformer Block . Utterance-level output embedding and frame-level output embedding are the outputs of the student backbone network . is the applying mask operation. , and are the same in the hidden layer dimensions, where and have the same temporal dimensions, while has temporal dimensions, respectively.
3.2 Utterance-level Loss
Utterance-level loss constructs an utterance-level pretext task to learn the global emotion. We use mean squared error (MSE) to calculate the loss, which can be written as:
(5) |
where
(6) |
(7) |
which means that utterance-level loss is computed by temporal pooling results of and . Here we propose three ways to compute utterance-level loss, which we call token embedding, chunk embedding, and global embedding, as shown in Figure 2.
Token Embedding
Token embedding employs a single token to represent global emotion information encoded by the student network . More explicitly, we set to 1 in the learnable utterance embedding .
Chunk Embedding
Chunk embedding employs multiple tokens to represent global emotion information. In this case, more global information can be aggregated within the chunk.
Global Embedding
In the case of utilizing global embedding, no additional utterance tokens are added. We use temporal pooling of frame-level output embedding instead of to compute the loss.
![Refer to caption](x2.png)
3.3 Frame-level Loss
Frame-level loss constructs a frame-wise pretext task to learn the context emotion. We only compute the loss on the masked part, which is the common practice for a mask language modeling(MLM) pretext task. The frame-level loss can be expressed as:
(8) |
where denotes the index sequence of frame-level output embedding being masked, and denotes the total number of tokens being masked.
3.4 Online Distillation
Online distillation is a self-supervised learning strategy for teacher-student learning, where the student network updates parameters by backpropagation and the teacher network updates parameters with an exponentially moving average (EMA) (Grill et al., 2020). For the student network , the total loss for backpropagation is a combination of frame-level loss and utterance-level loss , donated as:
(9) |
with a tunable weight . For the teacher network , The parameters are initialized as the same parameters of the student network , and then are updated with EMA within each mini-batch, donated as:
(10) |
where is a parameter that increases linearly during pre-training. In practice, within each mini-batch the parameters of teacher feature extractor are copied directly from , while the parameters of teacher backbone network are updated with EMA from and .
4 Experiments Setup
4.1 Initial Model
Different initial models lead to different architectures of feature extractors , backbone networks , and initialization parameters . Here we adopt two models, data2vec 222https://dl.fbaipublicfiles.com/fairseq/data2vec/audio_base_ls.pt and data2vec 2.0 333https://dl.fbaipublicfiles.com/fairseq/data2vec2/base_libri.pt, both of which have the same feature extractor design but different backbone network designs. The feature extractor is a 7-layer 1-D convolutional neural network with kernel sizes and strides , resulting in 320x downsampling. Given the raw audio input at a 16000 Hz sample rate, the output representations are 50 Hz with dimension 512. Then a linear projection for dimension transformation from 512 to 768 is applied, followed by the mask operation to construct the input for the backbone network . Here we briefly introduce different backbone networks in data2vec and data2vec 2.0.
data2vec
The backbone network contains a 5-layer learnable convolutional positional encoding followed by a 12-layer standard Transformer. Each Transformer block is set to 768 model dimension, 3072 bottleneck dimension, and 12 attention heads. Finally, a linear projection from 768 to 768 is equipped on the student outputs, the results of which are employed to calculate MLM loss with teacher outputs.
data2vec 2.0
The data2vec 2.0 model shares the same Transformer architecture with data2vec, except for one more CNN decoder. The Transformer encoder only encodes the non-masked parts of downsampled features , and then the masked parts are complemented with random Gaussian noise before being passed to the CNN decoder, in a MAE-style fashion, to improve efficiency. The CNN decoder is a 4-layer 1-D convolutional neural network with all kernel sizes set to 7, strides set to 1, and channels set to 384, without downsampling. A linear projection from 384 to 768 is equipped to compute MLM loss, which works the same way as data2vec.
4.2 Training Details
Self-supervised Pre-training
In the pre-training phase, we train emotion2vec with hours of unlabeled emotion data shown in Figure 1 with different initial models. For the training overhead, The pre-training is conducted on NVIDIA A10 Tensor Core GPUs, and we simulate GPUs by setting the update frequency to . We train emotion2vec for epochs, each of which takes about minutes. We use a dynamic batchsize, where the maximum number of tokens is . For the optimizing strategy, we use Adam with a learning rate of and a weight decay of . We train emotion2vec using a cosine learning rate scheduler, with proportion of linear warm-up. For the student model, each time step of the input has a probability of to be the start index, and the subsequent time steps are masked. The hyperparameter that controls the loss weight is set to . For the teacher model, we use the average of the top blocks of the transformer layer outputs for providing the training targets. We apply a linearly increasing strategy for from to for the teacher parameters exponentially moving average.
Supervised Fine-tuning
All model architectures of diverse downstream tasks are designed to be as simple as possible, to demonstrate the representation ability of the pretrained model. For the non-sequential task, following the common practice of SUPERB Yang et al. (2021), we use two linear layers with a ReLU activation function sandwiched between them. For the sequential task, we use two layers of gated recurrent units (GRU) to make predictions.
4.3 Datasets
Dataset | Pretrain | Downstream | Source | Emo | Spk | Lang | #Utts | #Hours |
IEMOCAP Busso et al. (2008) | ✓ | ✓ | Act | 5 | 10 | English | 5531 | 7.0 |
MELD Poria et al. (2019) | ✓ | ✓ | Friends TV | 7 | 407 | English | 13847 | 12.2 |
CMU-MOSEI Zadeh et al. (2018) | ✓ | ✓ | YouTube | 7 | 1000 | English | 44977 | 91.9 |
MEAD Wang et al. (2020) | ✓ | ✗ | Act | 8 | 60 | English | 31792 | 37.3 |
MSP-Podcast (V1.8) Martinez-Lucas et al. (2020) | ✓ | ✗ | Podcast | 8 | 10000+ | English | 72969 | 113.5 |
Total | ✓ | – | – | – | – | English | 169053 | 262.0 |
CMU-MOSI Zadeh et al. (2016) | ✗ | ✓ | YouTube | 7 | 89 | English | 2199 | 2.6 |
RAVDESS-Speech Livingstone and Russo (2018) | ✗ | ✓ | Act | 8 | 24 | English | 1440 | 1.5 |
RAVDESS-Song Livingstone and Russo (2018) | ✗ | ✓ | Act | 8 | 23 | English | 1012 | 1.3 |
SAVEE Jackson and Haq (2014) | ✗ | ✓ | Act | 7 | 4 | English | 480 | 0.5 |
M3ED Zhao et al. (2022) | ✗ | ✓ | TVs | 7 | 626 | Mandarin | 24449 | 9.8 |
EmoDB Burkhardt et al. (2005) | ✗ | ✓ | Act | 7 | 10 | German | 535 | 0.4 |
EMOVO Costantini et al. (2014) | ✗ | ✓ | Act | 7 | 10 | Italian | 588 | 0.5 |
CaFE Gournay et al. (2018) | ✗ | ✓ | Act | 7 | 12 | French | 936 | 1.2 |
SUBESCO Sultana et al. (2021) | ✗ | ✓ | Act | 7 | 20 | Bangla | 7000 | 7.8 |
ShEMO Mohamad Nezami et al. (2019) | ✗ | ✓ | Act | 6 | 87 | Persian | 3000 | 3.4 |
URDU Latif et al. (2018) | ✗ | ✓ | Talk shows | 4 | 38 | Urdu | 400 | 0.3 |
AESDD Vryzas et al. (2018) | ✗ | ✓ | Act | 5 | 5 | Greek | 604 | 0.7 |
RESD Lubenets et al. | ✗ | ✓ | Act | 7 | 200 | Russian | 1396 | 2.3 |
A summary of the datasets employed in our experiments is presented in Table 1. There are 18 emotional datasets including 10 different languages: 9 in English, and 1 in Mandarin, Bangla, French, German, Greek, Italian, Persian, Russian, and Urdu. For each dataset, it can be categorized in terms of Pretrain (i.e., whether used during the pre-training phase), Downstream (i.e., whether tested in the downstream task), Source (i.e., where samples collected), Emo (i.e., number of emotion categories), Spk (i.e., number of speakers), Lang, (i.e., Language), #Utts (i.e., number of utterances), and #Hours (i.e., total duration of samples). Speech data is extracted from these datasets and uniformly processed into a single channel of 16k Hz.
In the pretraining phase, we utilize five large-scale English datasets, including IEMOCAP Busso et al. (2008), MELD Poria et al. (2019), MEAD Wang et al. (2020), CMU-MOSEI Zadeh et al. (2018), and MSP-Podcast Martinez-Lucas et al. (2020), resulting in a total of 262 hours. The IEMOCAP corpus contains a total of 5 sessions and 10 different speakers, with each session being a conversation of two exclusive speakers. MELD is a multi-party conversational dataset containing about 13,847 utterances from 1,433 dialogues collected from the TV series ‘Friends’. MEAD is a talking-face video corpus featuring 60 actors and actresses talking with 8 different emotions at three different intensity levels. CMU-MOSEI is a multimodal dataset from YouTube for sentiment and emotion analysis in videos. MSP-Podcast is collected from podcast recordings that discuss a variety of topics like politics, sports, and movies.
Different datasets are used to test different downstream tasks with various languages. For main results in Section 5.2, we report cross-validation (CV) results on the IEMOCAP dataset. The original labels cover five classes, to be consistent and comparable with previous methods Ye et al. (2023); Chen et al. (2023b), we merge ‘excited’ with ‘happy’ to better balance the size of each emotion class, resulting in four classes. We conduct both leave-one-session-out 5-fold CV and leave-one-speaker-out 10-fold CV. Moreover, we report results on MELD under its original split setup, and RAVDESS Livingstone and Russo (2018), SAVEE Jackson and Haq (2014) datasets under a random leave-one-out 10-fold CV setup, which implies at each fold, all samples within the dataset are randomly split into 80%, 10%, and 10% samples in training, validation, and testing sets. Among them, speech in RAVDESS and SAVEE datasets is not seen in the pre-training stage, which demonstrates the generalization of the proposed model on out-of-domain corpora.
For language generalization task in Section 5.3, we report CV results for 9 out-of-domain datasets, including 1 in Mandarin (M3ED Zhao et al. (2022)), Bangla (SUBESCO Sultana et al. (2021)), French (CaFE Gournay et al. (2018)), German (EmoDB Burkhardt et al. (2005)), Greek (AESDD Vryzas et al. (2018)), Italian (EMOVO Costantini et al. (2014)), Persian (ShEMO Mohamad Nezami et al. (2019)), Russain (RESD Lubenets et al. ), and Urdu (URDU Latif et al. (2018)). If not specified, language generalization results are obtained using the random leave-one-out 10-fold CV as we mentioned above unless the dataset provides a set partition. Such as the RESD dataset, we follow its original split setup with 280 testing samples and 1116 training samples. Additionally, we allocate 10% from the training samples for validation and others for training.
For task generalization task in Section 5.4. We tested other speech emotion tasks, including song emotion recognition, emotion prediction in conversation, and sentiment analysis, on RAVDESS-Song Livingstone and Russo (2018), IEMOCAP and CMU-MOSI Zadeh et al. (2016) & CMU-MOSEI Zadeh et al. (2018). For song emotion recognition and emotion prediction in conversation, we report CV results. For sentiment analysis, we report results with its original split setup. To be comparable with previous work, the experimental setup varies according to the specific task.
5 Results
5.1 Evaluation Metrics
We apply commonly used evaluation metrics, weighted accuracy (WA), unweighted accuracy (UA), and weighted average F1 (WF1), to evaluate the performance of speech emotion tasks. WA corresponds to the overall accuracy and UA corresponds to the average class-wise accuracy. WF1 is a comprehensive evaluation, especially for the situation of sample imbalance.
5.2 Main Results
The results are shown in Table 5.2, where we compare different SSL pre-trained models on the IEMOCAP dataset, as well as larger-scale pre-trained models, and the latest specialist models designed for SER tasks. We follow the evaluation of SUPERB Yang et al. (2021), freezing the pre-trained model and training downstream linear layers with the hidden dimensional set to 256. As can be seen from the table, emotion2vec outperforms all existing SSL pre-trained models, across all base models with similar parameters and large models with greater parameters. Compared with Versper-12, an SER model obtained by distillation from WavLM-large, emotion2vec works better with fewer parameters. TIM-NET Ye et al. (2023), MSTR Li et al. (2023b), and DST Chen et al. (2023b) are the latest SER specialist models, respectively, which use different scales of upstream features and downstream networks. The proposed emotion2vec model outperforms or performs on par with these models with only linear layers, while their downstream networks have 2x, 135x, and 114x more parameters than emotion2vec, respectively. We provide the results of leave-one-session-out five-fold cross-validation and leave-one-speaker-out ten-fold cross-validation for reference.
We also conduct experiments on other mainstream English datasets to prove the generalization of emotion2vec in Table 3. MELD is a noisy dataset used to test the SER performance of the model in complex environments. RAVDESS and SAVEE are out-of-domain datasets with respective recording environments. Experimental results show that emotion2vec exhibits state-of-the-art performance on different datasets in different environments.
Model | Pre-training Corpus | Upstream | #Upstream Params | Downstream | #Downstream Params | WA(%) |
---|---|---|---|---|---|---|
Self-supervised Model | ||||||
small size | ||||||
wav2vec (Schneider et al., 2019) | LS-960 | Proposed | 32.54M | Linear | 0.13M | 59.79 |
vq-wav2vec (Baevski et al., 2019) | 34.15M | 0.20M | 58.24 | |||
\hdashlinebase size | ||||||
wav2vec 2.0 (Baevski et al., 2020) | LS-960 | Proposed | 95.04M | Linear | 0.20M | 63.43 |
HuBERT (Hsu et al., 2021) | LS-960 | 94.68M | 0.20M | 64.92 | ||
WavLM (Chen et al., 2022) | LS-960 | 94.70M | 0.20M | 65.94 | ||
WavLM+ (Chen et al., 2022) | Mix-94k | 94.70M | 0.20M | 67.98 | ||
data2vec (Baevski et al., 2022) | LS-960 | 93.75M | 0.20M | 67.38 | ||
data2vec 2.0 (Baevski et al., 2023) | LS-960 | 93.78M | 0.20M | 68.58 | ||
Vesper-4 (Chen et al., 2023a) | Mix-94k + LSED-206 | 63.52 M | 0.26M | 68.40 | ||
Vesper-12 (Chen et al., 2023a) | Mix-94k + LSED-206 | 164.29 M | 0.26M | 70.70 | ||
emotion2vec | LS-960 + Emo-262 | 93.79M | 0.20M | 71.79 | ||
emotion2vec* | LS-960 + Emo-262 | 93.79M | 0.20M | 74.48 | ||
emotion2vec | LS-960 + Emo-262 | 93.79M | 0.20M | 72.94 | ||
emotion2vec* | LS-960 + Emo-262 | 93.79M | 0.20M | 77.64 | ||
\hdashlinelarge size | ||||||
wav2vec 2.0 (Baevski et al., 2020) | LL-60k | Proposed | 317.38M | Linear | 0.26M | 65.64 |
HuBERT (Hsu et al., 2021) | LL-60k | 316.61M | 67.62 | |||
WavLM (Chen et al., 2022) | Mix-94k | 316.62M | 70.03 | |||
Supervised Model | ||||||
TIM-Net (Ye et al., 2023) | - | MFCC | - | CNN(TIM-Net) | 0.40M | 68.29 |
MSTR (Li et al., 2023b) | HuBERT-large | 316.61M | Transformer(MSTR) | 27.00M | 70.03 | |
DST (Chen et al., 2023b) | WavLM-large | 316.62M | Transformer(DST) | 22.78M | 71.80 |
Model | WA(%) | UA(%) | WF1(%) | WA(%) | UA(%) | WF1(%) | WA(%) | UA(%) | WF1(%) |
---|---|---|---|---|---|---|---|---|---|
MELD | RAVDESS | SAVEE | |||||||
WavLM-base | 46.95 | 16.34 | 35.16 | 37.01 | 37.11 | 36.08 | 42.08 | 38.46 | 38.93 |
WavLM-base+ | 43.78 | 16.75 | 34.60 | 38.89 | 38.40 | 37.75 | 43.54 | 39.27 | 42.19 |
data2vec | 45.75 | 24.98 | 43.59 | 69.58 | 69.70 | 69.25 | 82.50 | 82.26 | 82.37 |
data2vec 2.0 | 48.92 | 26.10 | 45.80 | 81.04 | 80.80 | 80.97 | 83.13 | 82.94 | 83.03 |
emotion2vec | 51.88 | 28.03 | 48.70 | 82.43 | 82.86 | 82.39 | 84.38 | 82.30 | 84.45 |
Model | WA(%) | UA(%) | WF1(%) | WA(%) | UA(%) | WF1(%) | WF1(%) | UA(%) | WF1(%) |
---|---|---|---|---|---|---|---|---|---|
AESD (Gr) | CAFE (Fr) | RESD (Ru) | |||||||
WavLM-base | 55.33 | 55.50 | 54.86 | 31.61 | 32.02 | 30.88 | 56.17 | 56.17 | 55.69 |
WavLM-base+ | 53.83 | 54.41 | 52.48 | 31.40 | 33.39 | 30.40 | 55.00 | 55.19 | 55.08 |
data2vec | 56.67 | 57.26 | 56.57 | 57.10 | 57.68 | 57.36 | 49.42 | 49.77 | 48.97 |
data2vec 2.0 | 71.33 | 70.20 | 70.93 | 71.51 | 72.98 | 71.50 | 64.08 | 64.33 | 64.17 |
emotion2vec | 72.33 | 72.27 | 71.57 | 74.52 | 75.26 | 74.53 | 64.75 | 65.04 | 64.53 |
Model | EmoDB (De) | EMOVO (It) | M3ED (Zh) | ||||||
WavLM-base | 59.06 | 55.32 | 58.96 | 40.17 | 40.34 | 37.36 | 44.03 | 18.90 | 34.50 |
WavLM-base+ | 65.66 | 64.60 | 64.83 | 40.34 | 41.98 | 40.11 | 45.09 | 20.18 | 36.49 |
data2vec | 67.17 | 64.81 | 66.52 | 51.21 | 51.97 | 49.82 | 44.44 | 21.10 | 37.77 |
data2vec 2.0 | 83.77 | 83.07 | 83.93 | 60.69 | 61.27 | 60.79 | 47.50 | 24.12 | 41.74 |
emotion2vec | 84.34 | 84.85 | 84.32 | 61.21 | 62.97 | 60.89 | 49.15 | 26.98 | 44.38 |
Model | SUBESCO (Bn) | ShEMO (Fa) | URDU (Ur) | ||||||
WavLM-base | 54.50 | 54.77 | 53.96 | 67.27 | 46.60 | 65.63 | 71.00 | 70.25 | 70.82 |
WavLM-base+ | 54.73 | 54.69 | 54.59 | 66.73 | 44.29 | 65.12 | 67.25 | 68.68 | 67.47 |
data2vec | 78.29 | 78.25 | 78.21 | 70.80 | 53.96 | 69.84 | 71.75 | 72.67 | 71.83 |
data2vec 2.0 | 87.91 | 87.95 | 87.90 | 77.90 | 62.03 | 76.96 | 77.50 | 78.42 | 77.12 |
emotion2vec | 90.91 | 90.96 | 90.91 | 79.97 | 66.04 | 79.56 | 81.50 | 81.87 | 81.60 |
Model | Upstream | Downstream | WA(%) | UA(%) | WF1(%) |
---|---|---|---|---|---|
Self-supervised Model | |||||
WavLM-base | Freeze | Linear | 52.3 | 52.4 | 52.1 |
WavLM-base+ | Freeze | 54.9 | 53.9 | 54.2 | |
data2vec | Freeze | 63.8 | 64.1 | 63.4 | |
data2vec 2.0 | Freeze | 73.0 | 74.6 | 72.7 | |
L-Net (Koh and Dubnov, 2021) | Freeze | 71.0 | - | - | |
SpecMAE (Sadok et al., 2023) | Finetune | 54.5 | - | 53.9 | |
VQ-MAE-S (Patch-tf) (Sadok et al., 2023) | Finetune | 84.0 | - | 84.0 | |
VQ-MAE-S (Frame) (Sadok et al., 2023) | Finetune | 84.2 | - | 84.3 | |
emotion2vec | Freeze | 85.0 | 85.2 | 84.8 | |
Specialist Model | |||||
VQ-MAE-S (Patch-tf) (Sadok et al., 2023) | Finetune | Query2Emo | 83.7 | - | 83.4 |
VQ-MAE-S (Frame) (Sadok et al., 2023) | 85.8 | - | 85.7 |