Multimodal Emotion Recognition Using Dynamic Ensemble Selection

Anonymous Authors Anonymous Authors Anonymous Authors Anonymous Authors

Abstract

The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). In understanding human emotions, multiple channels such as speech (voice) and facial expressions (image) are crucial. However, AI’s journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality – a frequent occurrence in real-world situations. The central focus in the study is on assessing the performance and its resilience when confronted with the absence of one modality, of two strategies, the first one is a novel multimodal ensemble learning and dynamic selection-based approach and the second a cross-attention mechanism. Results on RECOLA dataset show that dynamic selection-based methods are a promising approach for multimodal emotion recognition. In most of missing modalities scenarios, all dynamic selection-based methods (DS, DW, DWS, and Meta-DW) outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.

I Introduction

The world we live in is multimodal. In this context, modality refers to how we perceive and interact with our environment. Such a concept is crucial in advancing Artificial Intelligence (AI), as it underlines the AI systems’ need to understand and integrate these different modalities effectively. The study of the complementary relationships between modalities such as visual (video), auditory (audio), textual (text), and sensory signals (e.g., heart rate variability) is essential for develo** more sophisticated and context-aware AI systems [Baltrusaitis_Ahuja_Morency_2019].

Most current multimodal approaches consider the different modalities always available and carrying complementary information. This perspective is crucial in understanding how combining these modalities can lead to a richer and more comprehensive interpretation of underlying data. Each modality – visual, audio, textual, or otherwise – is believed to contribute unique insights. When merged, these insights form enhanced representations that are often more informative and accurate than any modality could achieve alone [Praveen_2023].

However, the ideal scenario of having all expected modalities available for a given task is not always the case [DMelloKory2015]. In the context of multimodal AI systems, the absence or unavailability of one or more modalities can significantly impact their performance and decision-making processes. Therefore, addressing missing modalities in multimodal learning is critical because it reflects the challenges of dealing with real-world data that may not always conform to ideal multimodal settings.

This work evaluates two possible strategies to model the interaction of modalities (video+audio) in the context of emotion recognition. The first is a novel approach based on multimodal ensemble learning and dynamic selection across the multimodal space. The second is a well-known strategy employing a neural network with an attention-based mechanism to learn jointly the modalities inspired by the method proposed in [Praveen_2022]. In other words, we focus on assessing the impact of one modality’s absence on model performance in these distinct multimodal AI approaches. Thus, two research questions oriented our experiments. The first is related to the proposed dynamic selection method, as (RQ1) - Could dynamic selection of modalities be a promising approach for a multimodal AI method?, while the second research question concerns the impact of a missing modality in each multimodal approach evaluated, formulated as (RQ2) - ”Which is the impact in terms of emotion recognition performance when one modality (video or audio) is missing?”. To this end, we simulate the loss of a modality by replacing the corresponding features with zeros or the mean feature values.

The contribution of this paper is twofold: i) it introduces a novel multimodal method for emotion recognition based on the dynamic selection of modalities, and ii) it assesses how the different approaches (dynamic selection .vs. attention mechanism) respond to the absence of a specific modality. This work not only scrutinizes the challenges faced by AI in multimodal emotion recognition but also explores innovative methodologies, aiming to push the boundaries of what these sophisticated models can achieve in understanding the complex landscape of human emotions.

II Related Works

The field of multimodal emotion recognition has been the subject of extensive research, given its applicability in various domains. This section highlights relevant works addressing the importance of emotion recognition through multiple modalities and dealing with missing modalities during inference.

Multiple modalities in emotion recognition are essential because humans simultaneously express emotions through various channels. Human communication encompasses both verbal and non-verbal expressions of emotions, necessitating our dependence on multimodal cues. When pronounced with emphasis in a deep voice, a simple word can convey a more intense emotion to the listener—a nuance that eludes capture through the analysis of linguistic features alone. Complex emotions, such as fear, find recognition more readily through the interplay of facial expressions and variations in pitch rather than relying solely on the transcripts of the spoken word [Aruna2023].

Relying on a single modality may lead to incomplete or inaccurate assessments of emotional states. In contrast, multimodal models are shown to enhance accuracy and reliability when compared to unimodal models [Aruna2023], [Liu2023mmfusion]. However, multimodal data is complex and heterogeneous. Some core technical challenges involved with using multimodal data in machine learning are choosing the best data representations, intramodal and cross-modal data correlation, misalignment between elements of different modalities, and fusion strategies [Aruna2023]. Dealing with missing is also one of the most important challenges in multimodal fields [Liu2023mmfusion].

Several factors can contribute to the absence or unavailability of specific modalities in multimodal emotion recognition (MER) approaches at inference time, such as malfunction of cameras or microphones resulting in the inability to capture specific modalities, disabling or restricting certain sensors due to privacy concerns, poor lighting conditions can hinder the accurate detection of facial expressions, background noise that may affect the capture and interpretation vocal expressions, invasiveness of sensor to capture physiological signals, etc. Understanding these causes is essential for develo** strategies to handle missing modalities effectively.

Handling missing modalities in multimodal approaches often involves three strategies to adapt to such situations LABEL:: (i) imputing or filling in missing modalities using data imputation or leveraging information from other available modalities; (ii) designing models that can gracefully handle scenarios where certain modalities are missing, potentially by learning to rely more on available modalities; (iii) develo** models that can adapt to varying modalities or different data distributions, allowing for better generalization when faced with missing modalities.

Several works propose innovative approaches to the absence of modalities during inference. When dealing with missing modalities, common approaches often perform imputations to address the absence of modalities before proceeding with additional computations. Simple imputation methods, such as filling missing values with zeros or mean values , are straightforward but may lead to considerable inaccuracies.

Other noteworthy contributions include the work of Da Silva–Filarder et al. [DaSilvaFilarder2021], which studies multimodal variational autoencoders and states that a critical property of multimodal generative models is to have efficient approaches to work with missing modalities and to enable cross-generation. Cross-generation involves using a subset of input modalities to generate output modalities missing from the input subset. The authors propose two methods: Exhaustive Cross Generation (ECG) and Latent Component Dropout (LCD). LCD is based on randomization and simulates missing modalities by applying a dropout mask to individual elements of the latent variables, one for each modality. It then reverts to a prior expert for those elements selected by the mask. ECG is a method based on the brute force that considers all possible subsets of modalities.

Li et al. [Li2024] also address the issue of missing modalities. According to the authors, these missing data harm extracting features of multimodal data, resulting in a decline in model performance and inaccurate results. Therefore, a multiple multi-head attention network based on an encoder with missing modalities (MMAN-M2) is proposed. The multi-head attention represents the modality based on the entire sequence, and the cross-map** is used to obtain the relationship between the modalities. Random missing modalities are encoded and combined with an optimization module to enhance the association between missing and non-missing modalities. The encoder and decoder module obtain global information and map global information to multiple spaces.

Zhu et al. [Zhu2023] proposed an invariant feature for a missing modality imagination network (IF-MMIN). This method relies on two encoders: the specificity encoder, responsible for extracting high-level features from raw input features, and the invariance encoder, which takes the modality-specific features obtained by the specificity encoder as input to extract modality-invariant features. Additionally, the method incorporates an invariant feature-based imagination module (IF-IM) that predicts the modality-specific features of the missing modality based on the available modality and generates a joint representation.

Vazquez-Rodriguez et al. [Vazquez-Rodriguez2023] proposed a transformer-based architecture for recognizing valence and arousal in a time-continuous manner, even with missing input modalities. A multimodal transformer Is used as an encoder to obtain representations from the different modalities, and a transformer decoder is used to process those representations and make predictions. An encoder-decoder attention mechanism (cross-attention) of the transformer decoder is used to weigh the importance of different modalities. The transformer decoder is auto-regressive, considering past predicted values when doing the current inference, which is essential when performing time-continuous predictions.

Other interesting works present a computational multimodal framework based on the transformer architecture and attention mechanisms for emotion recognition with incomplete data [Lin2023, Cheng2024, Liu2024, Nguyen2023]. Despite the impressive performance when a massive dataset is available, such kind of approach involves more complex operations, making interpretation less intuitive. In addition, attention mechanisms and Transformers can become computationally costly, even when some modalities are absent, due to the attention structure that may involve all pairs of elements in the sequence.

Dynamic selection (DS) can offer specific advantages compared to attention mechanisms and transformer architecture when dealing with missing modalities in multimodal solutions. DS may provide better interpretability, as the choice of specific regressors/classifiers in the absence of modalities can be easily analyzed. Additionally, dynamic selection can be more computationally efficient than the attention structure, which may involve all sequence elements. Finally, DS is less dependent on large datasets for training, as observed in our experiments.

Each approach has advantages and challenges, and the choice should be based on the specific demands of the task. In this work, we will analyze the impact of missing modalities in solutions based on dynamic selection and attention mechanisms in the context of arousal-valence regression. Can more straightforward solutions based on the imputation methods, filling missing values with zeros and mean values, effectively address the problem of absent modalities?

III Proposed Approach

The experiment approach was structured in three distinct phases. The first phase involved training the model using a fusion of audio and video modalities. In the second phase, the audio modality was disabled, and the relative contribution of the video modality to the regression task was assessed. Finally, in the third phase, the video modality was disabled, allowing for the evaluation of the audio modality’s relative contribution to the regression task.

This approach was inspired by sensitivity analysis methods used in the works of [NaikKiran], [featureSelectionAlceu], and [VERIKAS2002]. In these studies, sensitivity analysis was conducted at the feature level, examining the impact of subsets of features on the overall performance of a machine learning model. In the present work, however, the sensitivity analysis was performed by generating a zero feature vector or an average feature vector for the modality intended to be disabled. Subsequently, the model was tested with a fusion of this feature vector from the disabled modality and the active feature vector from the other modality.

This approach provided us with valuable insights into how each modality independently influences the model’s performance and how the strategies employed by Dynamic Selection and Cross-Attention handle the absence of specific modalities. By comparing the results from each phase, one could discern the individual and combined effects of audio and video modalities. Moreover, this analysis sheds light on the sensitivity of the proposed methods when confronted with missing modalities.

III-A Dynamic Selection

In this phase, we employ features from AVEC’16 [AVEC2016], encompassing audio and video modalities. As shown in figure 1, the audio features include acoustic features, MFCCs, and Mel spectrograms. On the video side, we incorporate appearance features and geometric features. Each regressor is trained independently, resulting in a pool of regressors denoted as $F={f_{1},f_{2},\dots,f_{N}}$ , where $N$ is the total number of regressors.

Two LSTM layers with 256 cells each are employed to consider the temporal structure of the data. For the recurrent layers, the input is segmented into sequences of 6 seconds, corresponding to 150 time steps (frames) at a sampling rate of 16kHz.

The models are evaluated in the dynamic ensemble selection (DES) phase. Each model receives a weight to assess each test case $x_{j}$ based on its performance in the competence region $\Psi$ - set composed of the K nearest neighbors of $x_{j}$ in the validation set. The DES selects the regressor with the smallest accumulated error in the competence region or combines all the regressors or a subset of them using weighted averaging according to a calculated weight $\alpha_{i}$ of regressor $f_{1}$ .

Refer to caption — Figure 1: Dynamic selection approach based on three steps: multimodal feature extraction, training, and testing. The audio features include acoustic features, MFCCs, and Mel spectrograms. The video features include appearance features and geometric features. All regressors are trained separately, and a pool of regressors is obtained. The models are evaluated in the dynamic ensemble selection (DES) phase, where each model receives a weight to evaluate each test case according to its assertiveness in the competence zone. The figure also illustrates the Dynamic Weighting Selection (DWS) calculation.

The impact of the absence of a modality was accessed using traditional dynamic selection techniques proposed [Moura_Cavalcanti_Oliveira_2021], adapted for the multimodal problem: Dynamic Selection (DS), Dynamic Weighting (DW) and Dynamic Weighting Selection (DWS). Additionally, we introduce Meta Dynamic Weighting (Meta-DW), a method of dynamic selection without the requirement to compute the competence region $\Psi$ . We further compute the simple average of the regressors’ outputs for comparison.

Dynamic Selection (DS): Select the regressor with the smallest accumulated error in the competence region.

Dynamic Weighting (DW): Combines all regressors in the set using weighted averaging. For each test pattern $x_{j}$ , its competence region $\Psi$ is calculated. For each item in $\Psi$ , a weight $d_{k}$ is calculated using Eq. 1, where ${dist_{k}}$ is a distance measure between the item in the competence region $t_{k}\in\Psi$ and the test pattern $x_{j}$ . The vector $d_{1},d_{2},...,d_{k}$ is used to calculate the weight $\alpha_{i}$ of regressor $f_{i}$ using Eq. 2, where N is the size of the regressor pool, $k$ represents the neighbor index, and $sqe_{k,i}$ is the squared error of regressor $i$ calculated using the item $t_{k}\in\Psi$ .

Dynamic Weighting Selection (DWS): Combines a subset of regressors; regressors with an average error greater than a threshold are discarded. The method for calculating the weights of the regressors and the strategy for combining the models are the same as the DW algorithm (Eqs. 1 and 2).

d_{k}=\displaystyle\frac{\displaystyle\frac{1}{dist_{k}}}{\displaystyle\sum_{j% =1}^{K}\displaystyle\frac{1}{dist_{j}}}

(1)

\alpha_{i}=\displaystyle\frac{\displaystyle\frac{1}{\displaystyle\sum_{k=1}^{K% }{(d_{k}*sqe_{k,i})}}}{\displaystyle\sum_{n=1}^{N}\displaystyle\frac{1}{% \displaystyle\sum_{k=1}^{K}{(d_{k}*sqe_{k,i})}}}

(2)

Meta Dynamic Weighting (Meta-DW): Indirectly perform the dynamic weighting selection by training a meta-classifier with concatenating outputs from each regressor (a vector of dimension N). Training is conducted on the validation set, and the ideal output is defined as the regressor with the best CCC, the regressor with predictions closest to the proper labeling.

It is important to emphasize that tests were conducted with the standard methods of dynamic selection, DS, DW, and DWS, with $K$ varying from $5$ to $150$ . All results presented in this work are with $K=100$ . The META-DW method showed a similar behavior to the results obtained with a high number of $K$ (above $30$ ). Unlike traditional techniques, with this meta-classifier approach, we can avoid the computational complexity of calculating distances between feature sets and the need for numerous tests by varying $K$ .

III-B Cross-Attention Architecture

The current work utilizes the cross-attention architecture proposed by [Praveen_2022]. In general terms, the cross-attention model is set to receive two data sequences from audio and video modalities, which are combined and processed to generate a single prediction of arousal or valence.

Let the feature vector sets $X_{a}$ and $X_{v}$ be extracted from the audio (A) and video (V) modalities from a fixed-size subsequence $S$ , where $X_{a}=\{x_{a}^{1},x_{a}^{2},\ldots,x_{a}^{L}\}\in\mathbb{R}^{d_{a}\times L}$ and $X_{v}=\{x_{v}^{1},x_{v}^{2},\ldots,x_{v}^{L}\}\in\mathbb{R}^{d_{v}\times L}$ . Here, $L$ denotes the number of non-overlap** clips taken uniformly from $S$ . In turn, $d_{a}$ and $d_{v}$ represent the dimensions of the audio and video features, respectively, where $x_{a}^{l}$ and $x_{v}^{l}$ are audio and video vectors for $l=1,2,\ldots,L$ clips.

The joint representation of audio and video features (J) is obtained from the concatenation of audio and video feature vectors $\textbf{J}=[\textbf{X}_{\textbf{a}};\textbf{X}_{\textbf{v}}]\in\mathbb{R}^{d% \times L}$ , where $d=d_{a}+d_{v}$ denotes the dimension of the concatenated features. The joint representation J of a subsequence S is used to focus attention on the unimodal representations $\textbf{X}_{\textbf{a}}$ and $\textbf{X}_{\textbf{v}}$ . In this regard, the joint correlation matrix for the audio features $\textbf{C}_{\textbf{a}}$ between the audio features $\textbf{X}_{\textbf{a}}$ and the representation J will be given by the expression:

\textbf{C}_{\text{a}}=\tanh\left(\frac{\textbf{X}_{a}^{T}\textbf{W}_{ja}% \textbf{J}}{\sqrt{d}}\right)

(3)

Where $\textbf{W}_{ja}$ represents the trainable weight matrix of dimension $L\times L$ . Similarly, the joint correlation matrix for the video features ( $\textbf{C}_{\textbf{v}}$ ) will be given by the expression:

\textbf{C}_{\text{v}}=\tanh\left(\frac{\textbf{X}_{v}^{T}\textbf{W}_{jv}% \textbf{J}}{\sqrt{d}}\right)

(4)

The matrices $C_{a}$ and $C_{v}$ represent the joint correlation between the input vectors $\textbf{X}_{a}$ and $\textbf{X}_{v}$ and the joint representation J. It is important to highlight that the correlation matrices $C_{a}$ and $C_{v}$ have a semantic meaning where higher values of $C_{a}$ and $C_{v}$ imply high correlation between the audio and video modalities and within each modality.

In turn, the audio modality $\textbf{X}_{a}$ and video modality $\textbf{X}_{v}$ are combined with the joint correlation matrices $C_{a}$ and $C_{v}$ to compute the attention maps $H_{a}$ and $H_{v}$ , respectively:

\textbf{H}_{\text{a}}=\text{ReLU}(\textbf{W}_{a}\textbf{X}_{a}+\textbf{W}_{ca}% C_{a}^{t})

(5)

\textbf{H}_{\text{v}}=\text{ReLU}(\textbf{W}_{v}\textbf{X}_{v}+\textbf{W}_{cv}% C_{v}^{t})

(6)

The attention maps are used to capture attention in each modality according to the following expressions:

\textbf{X}_{\text{att},a}=\textbf{W}_{ha}\textbf{H}_{a}+\textbf{X}_{a}

(7)

\textbf{X}_{\text{att},v}=\textbf{W}_{hv}\textbf{H}_{v}+\textbf{X}_{v}

(8)

Finally, the attention matrices $\textbf{X}_{\text{att},a}$ and $\textbf{X}_{\text{att},v}$ are concatenated to obtain the attention matrix given by:

\textbf{X}_{\text{att}}=\left[\textbf{X}_{\text{att},a};\textbf{X}_{\text{att}% ,v}\right]

(9)

which is fed into a densely connected layer that will predict values of arousal or valence.

IV Experiments

IV-A Database

The remote collaborative and affective interactions (RECOLA) dataset [recola] represents an extensive source of multimodal data, encompassing extracted features and raw data from various modalities, including audio, video, and physiological recordings (electrocardiogram and electrodermal activity). The labeling of the first five minutes of interaction for 18 participants is available.

The data is labeled within the repository, adhering to a continuous emotional scale. This labeling is mapped into a two-dimensional space, a psychologically grounded method for describing emotions through the linear combination of arousal and valence. The concept of representing emotions in arousal and valence follows the circumplex model proposed by Russell [russell1980circumplex].

The official metric for evaluating the performance of the problem is the concordance correlation coefficient (CCC) [AVEC2016]. The CCC captures the co-variation relationship between predictions and ground truth and accounts for any deviation. As a result, it offers a more accurate representation of the alignment between predictions and ground truth [e25101440]. Higher CCC values signify excellent performance in terms of consistency and accuracy. The calculation process for CCC is as follows:

CCC=\frac{2*\rho*\sigma_{y}*\sigma_{ŷ}}{\sigma_{y}^{2}+\sigma_{ŷ}^{2}+(\mu_{y}% -\mu_{ŷ})^{2}}

(10)

where $\rho$ is the Pearson correlation coefficient, $\sigma_{y}$ and $\sigma_{ŷ}$ are the standard deviations and $\mu_{y}$ and $\mu_{ŷ}$ are the means of actual and predicted emotional state.

This experiment emphasized two primary modalities: (i) audio and (ii) video. For the audio component, the eGeMAPS acoustic feature set was employed, which was extracted using the OpenSmile software and is available within the RECOLA dataset. Additionally, feature sets based on mel-frequency cepstral coefficients (MFCCs) and Mel spectrograms, both of which were extracted by the authors of this work, were utilized. The video component, on the other hand, has been focused solely on extracted features from the RECOLA dataset, including geometric features derived from 49 distinct facial landmarks and appearance features obtained by a principal component analysis (PCA) from 50,000 LGBP-TOP features.

IV-B Experimental Protocol

An experimental protocol based on the k-fold cross-validation method has been implemented to ensure the robustness and reliability of our findings.

The dataset comprised data from 18 individuals. To balance training and testing and ensure that our model was tested on unseen data, we allocated three individuals for testing and three for validation. The remaining participants were used for training. The experimental setup was repeated ten times, each time with a different configuration, to enhance the generalization of our results. In each iteration of the experiment, the participants were randomly shuffled.

We deliberately introduced a modality-absent condition to simulate a real-world scenario. These simulations are essential for assessing the robustness and adaptability of our model under less-than-ideal conditions. To simulate the absence of a modality, we calculated a zero vector and an average feature vector for all instances in the training set. In practical terms, this meant that for any given instance where a particular modality was supposed to be missing, its feature values were replaced with the calculated ones. This approach effectively mimics scenarios where a modality’s data is entirely unavailable, allowing us to observe how the model performs when deprived of information from one of the modalities.

IV-C Results

This section offers an in-depth analysis of the outcomes achieved by employing different techniques of DS and cross-attention on a modality-absent condition.

Table I displays the arousal and valence results, in terms of CCC, for the pool of regressors $F$ . The findings reveal that the audio modality better represents the arousal space, with acoustic features, MFCCs, and Mel spectrograms, achieving CCC values of $0.69$ , $0.64$ , and $0.68$ . Conversely, valence is more accurately represented by the video modality, with its appearance features and geometric features representations, achieving CCC values of $0.48$ and $0.56$ .

Figure 2 shows the arousal prediction of all models for the same test case, considering scenarios where both modalities are available and when each modality is individually unavailable. Under ideal conditions, all models exhibit a consistent pattern with similar predictions. However, at certain moments, one model aligns more closely with the gold standard, while another performs better at other times. When a modality is absent, a noticeable decline in performance is observed among models relying on representations of that particular modality.

For arousal, under ideal conditions, with all modalities available - video and audio, the highest performance with $CCC=0.72$ was observed when employing DW, DWS, and a simple mean of all regressors’ outputs. DS yielded the best outcome in the valence dimension with $CCC=0.48$ , surpassing DW, DWS, and mean, which all registered $CCC=0.46$ . Detailed results are shown in II.

Regarding arousal, methods based on dynamic selection (DW and DWS, $CCC=0.72$ ) outperformed the top-performing regressor alone, relying solely on acoustic features, $CCC=0.69$ . It shows that dynamic selection of modalities can be a promising approach for a multimodal AI method. Valence achieved its peak performance with geometric features, $CCC=0.56$ , and none of the proposed methods managed to surpass this benchmark in valence prediction.

IV-D Impact of missing modalities

This paper aims to show how the different approaches (Dynamic Selection and Cross-Attention) respond to the absence of a specific modality. Tables III and IV display the arousal and valence results, in terms of CCC, of all comparison methods - encompassing mean of the regressors’ outputs, dynamic selection (DS, DW, DWS, Meta-DW), and cross-attention. Figure 3 compares the gold standard, prediction with all active modalities, and prediction with the absence of each modality.

The scenario where a zero vector represents the absence of a modality yielded slightly superior results compared to representing the missing modality with the average feature vector across all instances in the training set. This occurs because the disparity in the absent modality becomes more apparent, compelling the methods to assign greater significance to the remaining modalities. When using the average feature vector from the training set, it may be unclear…

In the context of arousal, when utilizing a zero vector to denote the absence of a modality, the cross-attention method demonstrated heightened robustness in terms of sensitivity, exhibiting a $6.42\%$ increase in CCC when the video modality was absent. Contrarily, the remaining methods either sustained their performance or experienced some loss. Among the dynamic selection-based methods, DS exhibited no performance lowering, while Meta-DW showed a minimal decrease of $1.45\%$ . The mean method observed the most substantial decline, recording a significant loss of $15.28\%$ .

More pronounced performance losses were observed when the audio modality was absent. Several approaches witnessed a decline of over $50\%$ in performance, which is understandable as the audio modality most effectively represents the arousal space. DS emerged as the most robust approach in scenarios without audio, experiencing a performance decline of $35.82\%$ compared to the scenario with all available modalities. Following closely, Meta-DW demonstrated a CCC decline of $40.58\%$ .

Similarly, still in the context of arousal, employing the mean vector to signify the absence of the modality, and when the video modality is unavailable, the cross-attention method proves to be the most robust in terms of sensitivity of missing modalities, with a $6.42\%$ increase in CCC. Among the dynamic selection-based methods, DS and Meta-DW exhibited no performance decline. When the audio modality was absent, several approaches witnessed a decline of over $50\%$ in performance. DS and Meta-DW emerged as the most robust approaches in this scenario, experiencing a performance decline of $35.82\%$ and $40.58\%$ .

A contrasting pattern was observed in valence, where disabling the audio modality yields superior results. When the video modality is absent, DS and DWS prove to be the most robust methods. DS experiences a performance decline of $31.25\%$ in both scenarios of representing the missing modality — through a zero vector or mean feature vector. On the other hand, DW exhibits a drop of $39.13\%$ when the modality absence is represented by a zero vector and a more significant decrease of $43.48\%$ when the mean feature vector represents the modality absence. In the scenario where the audio modality is missing, the cross-attention method, DS, and meta-DW emerge as the least sensitive approaches, showcasing a $2.44\%$ , $2.08\%$ , and $11.36\%$ decrease in CCC, respectively.

Considering the scenarios of missing modalities, in arousal space, all dynamic selection-based methods (DS, DW, DWS, and Meta-DW) consistently outperformed the baseline, which simply involves calculating the mean of all regressors’ output. For valence, when the audio modality was absent, DS, DW, DWS, and Meta-DW achieved superior results compared to the mean baseline. In the scenario where video was unavailable, Meta-DW failed to surpass the baseline, while DS, DW, and DWS demonstrated a slightly enhanced performance compared to the mean baseline.

TABLE I: CCC for arousal and valence encompassing models based on acoustic features, MFCCs, Mel spectrograms, appearance features, and geometric features. Models were trained with two layers of LSTM with 256 cells and a time window of 6 seconds.

Features	Arousal	Valence
Acoustic	0.69±0.06	0.18±0.07
MFCCs	0.64±0.06	0.35±0.08
Mel Spectrograms	0.68±0.06	0.22±0.09
Appearance	0.42±0.09	0.48±0.06
Geometric	0.41±0.09	0.56±0.14

TABLE II: CCC for arousal and valence encompassing the mean of the regressors’ outputs, dynamic selection (DS, DW, DWS, Meta-DW), and cross-attention-based methods. The DS, DW, and DWS results were generated with K = 100.

Approach	Arousal	Valence
DS	0.67 ± 0.06	0.48 ± 0.10
DW	0.72 ± 0.04	0.46 ± 0.08
DWS	0.72 ± 0.04	0.46 ± 0.08
Meta-DW	0.69 ± 0.07	0.44 ± 0.11
Cross-Attention	0.46 ± 0.13	0.41 ± 0.17
Mean	0.72 ± 0.04	0.46 ± 0.08

TABLE III: Arousal results, in terms of CCC, encompassing mean of the regressors’ outputs, dynamic selection (DS, DW, DWS, Meta-DW), and cross-attention-based methods. The results are presented in the following scenarios: audio and video available, audio disabled and video disabled, simulating the absence of a modality with a zero vector and mean feature vector.

Modalities	Mean	DS	DW	DWS	Meta-DW	Cross-Attention
Audio and video available	0.72±0.04	0.67±0.06	0.72±0.04	0.72±0.04	0.69±0.07	0.46±0.13
Video disabled*	0.61±0.09	0.67±0.06	0.64±0.09	0.68±0.07	0.68±0.07	0.49±0.12
Audio disabled*	0.31±0.05	0.43±0.10	0.32±0.05	0.35±0.05	0.41±0.10	0.23±0.12
Video disabled**	0.57±0.05	0.67±0.06	0.59±0.05	0.64±0.05	0.69±0.08	0.49±0.12
Audio disabled**	0.27±0.04	0.43±0.08	0.28±0.04	0.30±0.04	0.41±0.10	0.23±0.12

^* i-th modality disabled/zero vector ^** i-th modality disabled/feature mean vector

TABLE IV: Valence results, in terms of CCC, encompassing the mean of the regressors’ outputs, dynamic selection (DS, DW, DWS, Meta-DW), and cross-attention-based methods. The results are presented in the following scenarios: audio and video available, audio disabled and video disabled, simulating the absence of a modality with a zero vector and mean feature vector.

Modalities	Mean	DS	DW	DWS	Meta-DW	Cross-Attention
Audio and video available	0.46±0.08	0.48±0.10	0.46±0.08	0.46±0.08	0.44±0.11	0.41±0.17
Video disabled*	0.27±0.08	0.33±0.09	0.27±0.08	0.28±0.08	0.24±0.06	0.13±0.15
Audio disabled*	0.32±0.06	0.47±0.08	0.33±0.06	0.35±0.07	0.49±0.12	0.40±0.14
Video disabled**	0.25±0.07	0.33±0.09	0.26±0.07	0.26±0.07	0.22±0.09	0.13±0.13
Audio disabled**	0.31±0.07	0.47±0.08	0.32±0.08	0.33±0.08	0.49±0.10	0.40±0.18

^* i-th modality disabled/zero vector ^** i-th modality disabled/feature mean vector

IV-E Discussion

The proposed approaches perform better in arousal than valence, especially when the audio features are available. It may be related to the fact that arousal, which relates to the emotional intensity or activation level, might be more distinctly captured in tone of voice, volume, and speech rate, even without visual cues. For example, screams or high intonations may indicate a more excited emotional state. Elements like rhythm and timbre in speech also reflect emotional excitement; rapid changes in rhythm or variations in timbre can be indicative of more intense emotional states. Audio data carry significant information about the emotional state and can be quite effective in capturing the subtleties of arousal levels.

Auditory cues might be less effective in conveying valence levels. Valence, associated with the positivity or negativity of emotions, is often reflected in facial expressions and might be more nuanced and complex to discern from audio alone. Visual cues are critical in identifying the valence levels, making video a more informative modality for this dimension.

Multimodal emotion recognition using dynamic ensemble selection appears to be an effective strategy for combining these modalities, showing good results under ideal conditions. Furthermore, these techniques exhibit notable robustness when confronted with the absence of specific modalities.

V Conclusion

Our investigation into the representation of time-continuous emotions, particularly in arousal and valence space, through different dynamic selection approaches has yielded valuable insights. Even in less-than-ideal conditions, multimodal emotion recognition systems demonstrated their versatility and reliability.

The findings reveal that under ideal conditions, with both modalities available, DW and DWS show the highest performance in arousal prediction. At the same time, DS outperforms other methods in valence prediction. For (RQ1) - Could dynamic selection of modalities be a promising approach for a multimodal AI method? - the results affirmatively show that dynamic selection-based methods are promising.

However, when a modality is absent, a noticeable decline in performance is observed, emphasizing the importance of each modality in contributing to accurate predictions. The outcome of missing modalities revealed interesting nuances, addressing the research question (RQ2) - ”Which is the impact in terms of emotion recognition performance when one modality (video or audio) is missing?”. In most of missing modalities scenarios, all dynamic selection-based methods (DS, DW, DWS, and Meta-DW) outperformed the baseline. The only exception occurred in the valence when the video modality was unavailable, where Meta-DW failed to surpass the baseline.

The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities and addressing the research questions posed.

Acknowledgment

This work was supported by CNPq (National Council for Scientific and Technological Development) grant 306688/2018-2.