Audio-Visual Approach For Multimodal Concurrent Speaker Detection ^†^†thanks: Identify applicable funding agency here. If none, delete this.

Amit Eliav Faculty of Engineering
Bar-Ilan University
Ramat-Gam, Israel
[email protected] Sharon Gannot Faculty of Engineering
Bar-Ilan University
Ramat-Gam, Israel
[email protected]

Abstract

Concurrent Speaker Detection (CSD), the task of identifying the presence and overlap of active speakers in an audio signal, is crucial for many audio tasks such as meeting transcription, speaker diarization, and speech separation. This study introduces a multimodal deep learning approach that leverages both audio and visual information. The proposed model employs an early fusion strategy combining audio and visual features through cross-modal attention mechanisms, with a learnable [CLS] token capturing the relevant audio-visual relationships.

The model is extensively evaluated on two real-world datasets, AMI and the recently introduced EasyCom dataset. Experiments validate the effectiveness of the multimodal fusion strategy. Ablation studies further support the design choices and the training procedure of the model. As this is the first work reporting CSD results on the challenging EasyCom dataset, the findings demonstrate the potential of the proposed multimodal approach for CSD in real-world scenarios.

I Introduction

Our research is a part of the Socially Pertinent Robots in Gerontological Healthcare (SPRING) project. SPRING aims to create an assistant robot for public spaces such as airports, malls, hospitals, etc. It involves multiple engineering disciplines and research labs around the world. The project requires audio-related tasks such as speech detection, speech enhancement, speaker detection, and speech separation, as well as video- and Natural Language Processing (NLP)-related tasks.

CSD is the task of identifying the presence and overlap of active speakers in an audio signal. It involves classifying audio segments into three classes: 1) no speech activity (noise only), 2) only a single active speaker, and 3) multiple active speakers. Accurate CSD is an important key component in many speech-processing applications like audio scene analysis and meeting transcription.

Accurate CSD is an important key component in many speech-processing applications like audio scene analysis, meeting transcription, speaker counting and diarization, speech detection, and speech separation. It is also an important part of many “cocktail party” scenarios involving analyzing spatial multi-microphone signals.

The CSD task remains challenging due to the inherent complexities involved in analyzing human speech. Variations in accents, pitches, and speaking styles across different individuals can make the accurate identification and detection of active speakers difficult. Additionally, real-world audio signals often contain varying environmental noise and reverberation, further contributing to the difficulty of this problem. Consequently, CSD continues to be an active area of research, with ongoing efforts aimed at develo** more robust and accurate methods to handle this task.

In this study, we introduce a deep learning approach for multimodal audio-visual models aimed at addressing the CSD task. In scenarios such as those encountered in the SPRING project, both audio and video modalities are often accessible. These multimodal datasets have become far more common in recent years, leading researchers to explore audio-visual approaches for the CSD task.

Combining both modalities can enhance the model’s accuracy by providing a more comprehensive and robust representation of the environment. While audio data may be affected by surrounding acoustic noise, video data tends to be more resilient, potentially capturing speakers even in noisy environments with minimal visual interference. However, relying solely on video data for a CSD model is constrained by the camera’s field of view, potentially missing speakers outside its scope.

Therefore, we investigate both audio-only and visual-only models and compare them to a multimodal audio-visual model to highlight its advantages. We evaluate these different models using real-world databases such as AMI [1] and EasyCom [2], demonstrating improvements over existing approaches.

In [3] a multichannel CSD model is presented as one of the building blocks of an Linearly Constrained Minimum Variance (LCMV) beamformer. It is used as a controller for the different components needed to be estimated during the LCMV beamformer algorithm.

In [4, 5] both attention mechanisms and Convolutional Neural Networks (CNN) are used to create speaker-counting, speech recognition, and speaker identification models.

In [6, 7], presents the related Overlapped Speech Detection (OSD) task with a Long Short-Term Memory (LSTM) model.

In [8] we are presented with a model for the speech separation task leveraging the attention mechanics, which was first used in NLP related tasks [9, 10]. The attention mechanics is also used in [11] for an audio classification task.

In other research works, there are two important related tasks to CSD, both binary audio classification which will be formally presented in Section II. The first is Voice Activity Detection (VAD), which classifies audio into speech or non-speech, and the second is OSD which classifies audio into overlapped speech or non-overlapped speech.

In [12] a Temporal Convolutional Networks (TCN)-based model is used for three tasks: VAD, OSD and a combined VAD+OSD, and in [13] a Transformer-based model is used for the same tasks.

In [14] a multichannel Transformer-based model is used for OSD task.

Pyannote [15] is a Python library that provides various models for audio-related tasks, including speaker diarization, VAD, and OSD. This is the only publicly available package that allows us to directly extract results from datasets we use and compare them to our findings. For the remaining comparisons, we rely on the results reported in the respective papers.

The three tasks: VAD, OSD, and VAD+OSD (which is equivalent to CSD) are tackled with a deep-learning model based on WavLM [16] and TCN in [17]. In [18] a multi-task model is presented for the three tasks: VAD, OSD, and a Speaker Change Detection (SCD) using fine-tuning a ’wav2vec 2.0’ [19] architecture.

In [20] a model is presented for combined tasks: speaker counting (up to 2 speakers), speech separation, and speech enhancement. If a single speaker is detected the model enhances it, in case there are overlap** two speakers, the model first separates the speakers and later enhances each of them.

In our late paper [21] we presented an audio-only CSD model, for both single and multichannel audio data.

Multimodal models have shown improvement over single modality, such models are widely used in many applications, such as in [22] where Light Detection and Ranging (LiDAR) and camera data are fused for a vision task. While most of the works we have referred to so far were based on audio-only datasets and models, which can be limited in capturing the full context, there are recent works that incorporate multimodal approaches utilizing both audio and visual information for audio-related tasks.

In [23], there is an audio-visual model, as well as an audio- and video-only model presented for the OSD task.

In [24, 25] we are presented with an audio-visual model for speaker localization task with the recently published EasyCom dataset [2].

Other works such as [26, 27] present additional audio-visual models for audio-related tasks such as diarization, speech separation, dereverberation, and recognition.

In this paper, we propose an algorithm to solve the CSD task. Our contributions are mainly: 1) present an audio-visual multimodal model for multichannel audio and multimodal fusion scheme, 2) provide audio and visual augmentation techniques which were proven to enhance the training procedure, 3) similar to the recent papers, we evaluate the performance of the proposed model on the AMI dataset [1], additionally, we are the first, to the best of our knowledge, to report VAD, OSD, and CSD results for the recent EasyCom dataset [2].

II Problem Formulation

While our main focus is on the CSD task, we begin by defining two related and common speaker detection tasks: Voice Activity Detection (VAD) and Overlapped Speech Detection (OSD).

Let $X_{A}\in\mathbb{R}^{N\times L}$ represent the audio data, where $N$ is the number of microphones, and $L$ is the data length in samples. Let $X_{V}\in\mathbb{R}^{T\times C\times H\times W}$ represent the visual data, where $T$ is the number of frames, $C$ is the number of channels, and $(H,W)$ is the image resolution.

VAD is a binary classification task, that aims to classify speech and non-speech regions in an audio signal. Formally, VAD classifies it into one of two classes, as indicated in (1):

\mathrm{VAD}(X_{A},X_{V})=\begin{cases}\textrm{Class \#0}&\textrm{Non-speech % activity}\\ \textrm{Class \#1}&\textrm{Speech activity}\\ \end{cases}.

(1)

Where the detected speech activity regions can contain both a single active speaker and multiple active speakers.

OSD is a binary classification task as well, that aims to classify overlapped and non-overlapped speech regions in an audio signal. Formally, OSD classifies it into one of two classes, as indicated in (2):

\mathrm{VAD}(X_{A},X_{V})=\begin{cases}\textrm{Class \#0}&\textrm{Non-% overlapped speech}\\ \textrm{Class \#1}&\textrm{Overlapped speech}\\ \end{cases}.

(2)

Where the detected non-overlapped regions can contain both noise-only and single-active speaker signals.

While VAD and OSD serve as fundamental building blocks, they face limitations in distinguishing between different types of signals combined within the same class. In VAD, both single-speaker and overlap**-speaker speech are grouped into one class, despite potentially exhibiting different statistical behaviors. Similarly, in OSD, noise-only and single-speaker segments are treated as a single class, although they represent distinct acoustic scenarios. By separating these cases into individual classes, CSD provides a finer-grained categorization, enabling a more comprehensive understanding and analysis of the acoustic scene.

The multimodal CSD algorithm combines both the VAD and OSD tasks into a single multi-class classification task. The CSD model classifies each video frame and its corresponding audio segment (either single-microphone or multi-microphone) into one of the three classes as indicated in (3), for each $f\in T$ :

\mathrm{CSD}(X_{A},X_{V})=\begin{cases}\textrm{Class \#0}&\textrm{Noise only}% \\ \textrm{Class \#1}&\textrm{Single-speaker activity}\\ \textrm{Class \#2}&\textrm{Concurrent-speaker activity}\end{cases}.

(3)

The distribution of statistical features within audio segments can exhibit significant variability depending on the underlying acoustic scene. For instance, class ‘0’ (denoting ”Noise-Only” segments) may encompass various noise types, each with distinct statistical characteristics. Similarly, class ‘1’ (‘Single-speaker activity’) presents challenges due to the inherent diversity of human speech. Individual speakers possess unique accents, speaking styles, and vocal characteristics, posing obstacles to an accurate identification. Furthermore, class ‘2’ (‘Concurrent-speaker activity’) introduces additional complexity due to the variable number of active speakers, leading to a wider range of statistical properties within the segments. The presence of background noise or reverberation can further complicate these challenges.

The visual domain can be useful for the CSD task since it is indifferent to acoustic noise, on the other hand, it may lack crucial information due to low visibility of the active speakers in the scene, or active speakers out of the field of view of the camera.

Consequently, develo** robust and accurate CSD methods is critical to handle the inherent complexity and variability of real-world scenarios. By fusing information from both audio and visual modalities, we can potentially enhance the performance and robustness of CSD models. This multimodal approach can provide complementary cues that address limitations present in individual modalities alone, leading to a more comprehensive understanding of the acoustic scene.

III Proposed Model

The proposed model is based on several building blocks, including features extraction backbones, audio and visual blocks, and a fusion scheme. We chose to use pre-trained audio and video models as backbone feature extractors. In addition, a fusion technique [28] must be considered to join the audio-visual modalities together. We examine both early and later fusion, and other blocks such as multi-head attention (MHA) to pass the information between the modalities. The proposed model uses early fusion techniques to jointly process the information and perform the CSD classification task.

The audio backbone extracts features from the input multichannel audio data, it is based on a pre-trained HuBERT model [29]. The HuBERT model is applied to each microphone signal, the last Transformer layer is used for extracting the tokens. There are $S^{\prime}$ , depending on the input length, tokens of dimension $768$ extracted from each audio channel. The extracted tokens from the multichannel data are concatenated along the first dimension, resulting in a $(S\times 768)$ features tensor, where $S=N\cdot S^{\prime}$ and $N$ is the number of microphones.

The visual backbone extracts the features from the visual data. The input to the visual backbone is the streams of cropped faces extracted from the original video data, as described in Section III-A. A pre-trained R3D-18 model [30] is used as the backbone feature extractor for each video stream, and all the extracted features are concatenated along the stream dimension resulting in a tensor of size $(\#Streams\times 512)$ , where 512 is the visual feature dimension.

The audio and visual blocks start the fusion of the two modalities, followed by the rest of the layers of the fusion scheme and the classification layer that obtains the final output predictions.

III-A Pre-Processing and Input data

Both the audio and visual data must be pre-processed, each with a different pipeline. The microphone signals are first resampled to 16kHz, to match the audio backbone’s sampling rate. The video data is split into 7-frame-long clips and is used for the extraction of a cropped video stream for each of the faces in the scene which is done by a YOLOv8 model trained for face detection ¹¹1The trained models are available on https://github.com/akanametov/yolov8-face, we used the ’yolov8n-face.pt’ model and set with tracking mode. Each stream is then reshaped to a resolution of $(224\times 224)$ . The maximum number of streams depends on the dataset and the maximum simultaneous number of faces detected in a 7-frame-long clip. In the Easycom dataset, it is 8, whereas in the AMI dataset, it is 7. If a segment’s number of detected streams is less than the maximum number of streams in the dataset it is zero-padded. For the AMI dataset, we use all 4 ’Closeup’ cameras and concatenate all their detected streams.

The output labels are determined using the transcribed datasets, with a resolution of a single video frame, which is 0.04s and 0.05s for 25 Frames Per Second (fps) and 20 fps, for the EasyCom and AMI datasets respectively. We use 7 frames of video and the matching audio data as the input to the model. Therefore, the overall dimensions of the inputs are $(N\times L)$ for the audio tensor, and $(\#Streams\times 7\times 3\times 224\times 224)$ for the visual tensor, where $L=5600$ for EasyCom, and $L=4480$ for AMI. The output prediction is a tensor of size $(7\times 3)$ which are the 3 classes predictions for each of the 7 input video frames.

III-B Data Augmentation

Most available datasets for the CSD task are highly unbalanced between the classes, shown in Table I, as typical to natural human conversations. This imbalance is addressed during the training process using several techniques such as tuning the loss function, as will be discussed in Section III-E, and by data augmentations. The data augmentation and balancing of the data are important in classification tasks to prevent the model’s results from being biased towards the majority class. Data augmentation is a useful tool in such cases, both for audio and visual data.

For the audio data, we use the following augmentation procedures: 1) pitch shift²²2pytorch.org/audio/main/generated/torchaudio.transforms.PitchShift.html in the time domain, and 2) spectral masking, in the frequency domain. The spectral masking operation can mask frequency bands for the entire time frame or by using patches.

For the visual data, we use data augmentation methods using Pytorch³³3pytorch.org/vision/stable/transforms.html, such as: ‘Random Rotation’, ‘Elastic Transform’, ‘Random Horizontal Flip’, ‘Color Jitter’, ’Grayscale’, ‘Gaussian Blur’, ‘Random Adjust Sharpness’. An additional data augmentation technique employed for the visual modality is random masking - setting patches of pixels to zero. Specifically, approximately 45 patches of size $10\times 10$ pixels are randomly distributed and masked across each video frame. Figure 1 shows samples of the visual data augmentations.

III-C Architecture - Backbones, Audio- and Visual-Blocks

The audio backbone is based on a pre-trained HuBERT model [29] which is used as a feature extractor for each of the microphone input data. The audio backbone receives the preprocessed tensor of shape $(N\times L)$ , and the audio backbone is applied to each microphone signal, the last Transformer layer is used for extracting the tokens. There are $S^{\prime}$ tokens of dimension $768$ extracted from each audio channel. The extracted tokens from the multichannel data are concatenated along the first dimension, resulting in a $(S\times 768)$ features tensor, where $S=N\cdot S^{\prime}$ and $N$ is the number of microphones. Concatenation along the first dimension (the microphone dimension) is supported by our recent study where we compare 3 types of merging strategies [21] of multichannel audio data for the CSD task.

The visual backbone receives the streames of cropped faces after the preprocessing, as described in Section III-A. The visual backbone is based on a pre-trained R3D-18 model [30] and is used as a feature extractor for each stream. For each stream, the R3D-18 model extracts a feature vector with a dimension of $512$ , and all the extracted features are concatenated along the stream dimension resulting in a tensor of size $(\#Streams\times 512)$ .

These two initial steps of preprocessing and feature extraction from each modality are presented in Fig. 2 and demonstrated for the EasyCom dataset. The two backbones are therefore used to extract the two modalities’ feature vectors, of shapes $(S\times 768)$ and $(\#Streams\times 512)$ for the audio and visual respectively.

The audio and visual blocks, as shown in Fig. 3, share a similar architecture, consisting of normalization layers, MHA, and fully connected layers. These blocks serve both to enhance the features of their respective modalities and as part of the fusion scheme, as described in Section III-D.

III-D Architecture - Fusion and Classification

The first step of fusing the audio-visual modalities starts in the audio and visual blocks with a normalization layer of each of the modalities tokens. Normalization layers are employed, separately for each modality, before and after the MHA layer to ensure that the extracted tokens are on a similar scale, mitigating the potential impact of different value ranges across modalities on the subsequent layers.

The MHA is used with a cross-modality strategy, where each modality uses the other modality’s tokens as the Q input tensor. The MHA layer passes and extracts the information within each modality’s tokens as well as across the two modalities, thereby initiating the early fusion of the audio and visual data.

Each of the feature extraction backbones extracts tokens in a different dimension, 768 for the audio, and 512 for the visual. A fully connected layer is employed for each modality to project the tokens into a common dimension $D$ , this ensures that the tokens from different modalities are represented in a shared embedding space.

The projected tokens from the two modalities are then concatenated with a Class token [CLS] (of the same dimension), an additional learnable token. The concatenated tokens are fed into $M$ multimodal attention blocks, consisting of MHA and normalization layers. Each block consists of a MHA mechanism that captures cross-modal interactions among the two modalities fused tokens, followed by a normalization layer that stabilizes the process. These stacked blocks allow the model to refine the cross-modal representations, enabling it to capture the relationships and dependencies across the two modalities.

The classifier takes only the token corresponding to the [CLS] token as input and outputs a tensor of size $(7\times 3)$ which is used for the prediction of the 7 input visual frames and corresponding audio. The [CLS] token mechanism should make the classification process unbiased towards any of the input tokens, as discussed in [31] when using a Transformer model, and has proven to be effective in our recent study [21]. The audio-visual modalities fusion scheme, the multimodal MHA blocks, and the classification layer are presented in Fig. 3.

The early fusion scheme combined with the [CLS] token mechanism, as described earlier, is our choice as the proposed model, since it is based on the empirical results of the model training. It also builds upon our previous research [21] that incorporates the [CLS] token in an audio-only CSD model. However, we also evaluated three alternative fusion strategies and configurations during development. Specifically, we explored early fusion without the [CLS] token (Fig. 4(a)), late fusion with the [CLS] token (Fig. 4(b)), and late fusion without the [CLS] token (Fig. 4(c)).

In the later fusion variants, both with and without the [CLS] token, the overall fusion scheme and architecture are similar to the proposed early fusion model. However, a key difference lies in the configuration of the MHA layers at the beginning of the fusion process. In general, these MHA layers take three inputs: the query (Q), key (K), and value (V) tensors. In the late fusion approach, we use the same modality feature vector as input for all three Q, K, and V tensors within each modality branch. On the other hand, in the early fusion variants, we use a cross-modality input strategy. Specifically, each modality MHA uses the other modality’s features as the Q input tensor. This early fusion cross-modality configuration, which is also used in [23] for an OSD model, aims to enable early integration and fusion of the audio and visual modalities, potentially allowing the model to capture cross-modal relationships and dependencies more effectively in the early stage of the model at the features level.

Excluding the [CLS] token from the fusion scheme led to the classifier receiving a very large feature vector, which resulted in an excessive number of parameters in the fully connected classification layer, making it a less desirable choice. As for the late fusion strategies, their performance was ultimately less compelling than the early fusion approach. Ultimately, these reasons led us to choose the early fusion scheme with the [CLS] token mechanism as the proposed model, the in-depth analysis is presented in Section IV-C and Table II.

III-E Objective Functions

As the model is used for the CSD task, a classification task, the common choice for the loss function is the Cross-Entropy (CE) loss. To address the imbalance classification results of the model between the 3 classes, class weights ⁴⁴4https://towardsdatascience.com/class-weights-
for-categorical-loss-1a4c79818c2d are incorporated into the loss calculation, assigning higher weights to under-classified classes. Additionally, Label-Smoothing (LS) [32] is applied to the ground-truth labels, which introduces a small degree of noise and prevents the model from overconfident predictions. LS has been shown to improve generalization performance and mitigate overfitting.

By combining CE loss with class weighting and LS, the training objective aims to optimize the model’s ability to accurately classify the data across both modalities while accounting for samples that are less accurately classified and promoting better generalization.

Besides the combination of CE loss, class weighting, and LS, which we consider as the baseline loss formulation, we explored alternative loss functions and regularizations to train our model and address the class imbalance issue. Specifically, we explored two additional losses as regularizers for the baseline loss: Cost-Sensitive (CS) [33] loss and Focal-loss [34]. The incorporation of the CS loss made the training process less stable. Additionally, the Focal-loss did not exhibit a clear impact on the model’s performance, failing to provide substantial improvements over the baseline loss formulation. As a result, we opted for the combination of CE loss, class weighting, and LS, which proved to be the most effective approach for optimizing the audio-visual CSD model.

IV Experimental study

IV-A Datasets

We evaluated the performance of our model using two real-world datasets, the EasyCom dataset [2], and the AMI dataset [1]. Both datasets use a microphone array, EasyCom with 6 microphones, and AMI with 8, but differ in the available cameras.

The AMI [1] dataset holds 100 hours of meeting recordings of English speakers (both female and male), the participants were recorded in 3 different room environments and different acoustic setups. The AMI dataset is recorded with an 8-microphone array and a few different cameras, including a closeup camera for each participant, a corner camera, and an overview camera. All sessions have 4 closeup cameras, which are used in this work, as described in Session III-A.

The EasyCom dataset [2], a relatively new dataset, is recorded using Meta’s Augmented-Reality (AR) glasses set. The set has a 6-microphone array and a wide-angle single camera. The dataset was collected in a noisy simulated restaurant environment, with multiple English speakers engaging in conversations during several tasks. The EasyCom dataset presents two key challenges stemming from using the AR glasses worn by one participant during the meetings. Firstly, the audio amplitude of the wearer’s speech is considerably higher compared to other active participants due to the proximity of the microphone array. Secondly, rapid head movements by the wearer result in rapid changes in the visual data, causing shifts in the perceived locations of the speakers relative to the glasses’ viewpoint, which also alters the acoustic characteristics of the speakers’ voices. These simultaneous movements of both the speakers and the recording device contribute to the complexity of this multimodal dataset. Since the EasyCom dataset is limited, with only about 6 hours of data, and is highly unbalanced, we used multiple instances of the training set with different augmentations as described in Section III-B, and split the dataset into segments (7-frame-long clips) with a large overlap of 6 frames.

Both datasets exhibit a significant class imbalance, towards classes 1 and 0. This imbalance arises from the natural dynamics of human conversation, where participants tend to take turns speaking, with minimal overlap** speech from multiple individuals. This imbalance must be addressed during the model training, which is done by three methods: First, data augmentation, as described in Section III-B. Second, balancing the datasets, where we filtered the training set to make it more balanced between the classes. We included all the segments with ’class 2’ frames and later added more segments to reach an overall more balanced training set. Third, tuning the loss function, as described in Section III-E. The distribution of the different classes is depicted in Table I, for both the original datasets and the datasets after balancing and augmentations.

TABLE I: Class frequency [%] in the training set for all datasets. The number of frames is given in million [M].
Dataset* for a balanced, and Dataset† for a balanced and augmented dataset.

Dataset/Class	#0 [%]	#1 [%]	#2 [%]	#Frames [M]
AMI	16.8	71.8	11.4	7.1
AMI*	40.3	29.4	30.3	2.6
AMI†	40.3	29.4	30.3	7.8
EasyCom	30.5	58.2	11.3	0.255
EasyCom†	22	39	39	1.2

IV-B Algorithm Setup

We used the architecture described in Section III and shown in Fig.2 and Fig. 3, with the early fusion scheme and the [CLS] token mechanism. The fusion dimension is set with $D=512$ and the multimodal attention block is set with $M=4$ .

To account for the varying number of detected video streams per data sample, we padded all samples to a fixed length of streams (as described in Section III-A). Moreover, to deal with the order of the detected face, during training, we randomly shuffled the order of the streams within each sample. This is to ensure that the model doesn’t fit over the order of the detected streams, and the zero-padding streams.

In training the model, we used the Adam optimizer with a different learning rate for the different layers of the model, a weight decay of $1e^{-9}$ , and a batch size of 64. The learning rate was set to $1e^{-7}$ for the audio backbone, $1e^{-6}$ for the visual backbone, and $1e^{-4}$ for the rest of the layers (the audio and visual blocks, the fusion scheme and the classification layer). This differential learning rate assignment allows for fine-tuning the large pre-trained backbones at a slower pace, preventing drastic changes to the learned representations, while enabling the fusion and classification components to adapt more rapidly to the target CSD task.

Initially, an attempt was made to freeze the audio and visual backbones and not train both, however, this resulted in poor overall performance (presented in Fig. VII), potentially due to the backbones not being trained specifically for the CSD task, leading to suboptimal feature representations for the fusion and classification stages, and the downstream task.

To prevent overfitting, given the model’s substantial number of parameters: audio backbone 94M (million), visual backbone 33M, and the rest of layers 8M, which sums up to a total of about 135M parameters, the training process was limited to a modest number of epochs, typically between 3 and 5 epochs, with the specific value depending on the dataset under consideration.

IV-C Results

The performance of classification models is typically evaluated using several common metrics, including Accuracy, Precision, Recall, F1-score, and mean Average Precision (mAP). Additionally, the confusion matrix provides a detailed comparison between the ground-truth labels and the model’s predicted labels, normalized as percentages with respect to the ground-truth labels. In this study, we apply our model and training scheme and evaluate the performance using two real-world datasets, AMI [1] and EasyCom [2], using the earlier-mentioned metrics. These metrics allow for a comprehensive assessment of our model’s performance and enable comparisons with other methods, provided that the same metrics are reported.

While the proposed model takes 7 video frames and their corresponding audio as input and outputs predictions for each of the 7 frames, we observed that the performance metrics were highest for the middle frame (the 4th frame). Therefore, in this work, we report the results only for the middle frame, as it represents the best performance of the model. The other 6 frames serve as context information, hel** the model to better classify the middle frame. During inference, the model will still process 7 frames as input, but only the output prediction for the middle frame should be considered. The input window will then slide by 1 frame to obtain the prediction for the next middle frame.

The results of the different versions of our model are depicted in Table II, we compare multiple settings - the early and later fusion scheme, and the integration of [CLS] token. This comparative analysis aims to provide insights into the influence of the fusion strategy and the contribution of the [CLS] token for the audio-visual CSD. In addition, we compare the audio-visual variants with two of our audio-only models and a visual-only variant. The first audio-only model is from our recent work [21] re-trained for the new EasyCom dataset. The second audio-only variant is based on the current proposed model, where we use the model’s architecture without the visual branch. Similarly, the visual-only variant is based on the proposed model without the audio branch.

TABLE II: A comparison of the proposed audio-visual model across four configurations, evaluating the performance on the VAD, OSD, and CSD task, including Accuracy (A), Precision (P), Recall (R), F1-score (F1), and mAP (%) measures on the EasyCom dataset. Bold: best overall, underlined: best within modality.

		VAD					OSD					CSD
Modalities	Method	A	P	R	F1	mAP	A	P	R	F1	mAP	A	P	R	F1	mAP
Audio	[21]	74.1	73.5	74.1	72.5	87.5	81.6	85.9	81.6	83.5	25.0	59.5	62.9	59.5	60.2	66.3
Audio	Audio-Block	76.8	77.2	76.8	77.0	89.1	82.5	85.5	82.5	83.9	25.0	59.8	64.9	59.8	61.0	66.9
Visual	Visual-Block	64.7	66.1	64.7	65.2	79.7	83.9	84.7	83.9	84.3	19.3	53.1	54.4	53.1	53.5	55.9
Audio-Visual	Early, w/o [CLS]	74.8	75.4	74.8	75.0	88.0	87.7	86.1	87.7	86.8	27.6	64.1	64.3	64.1	64.0	68.5
	Early, with [CLS]	79.0	81.2	79.0	79.4	92.8	90.0	87.0	90.0	86.6	32.8	70.4	69.6	70.4	67.9	71.7
	Late, w/o [CLS]	41.1	52.3	41.1	38.6	63.5	89.8	85.8	89.8	85.1	10.8	35.1	52.9	35.1	18.4	40.9
	Late, with [CLS]	77.5	78.4	77.5	77.7	90.4	82.6	87.4	82.6	84.4	31.3	61.5	67.7	61.5	62.5	71.0

The confusion matrices for both datasets are depicted in Table III, we report the results for the best audio-visual model variant which uses early fusion and [CLS] token.

TABLE III: CSD results: confusion matrices, as [%] normalized to the ground-truth labels. ‘T’-true labels, ‘P’-predicted labels.

	AMI			EasyCom
T \P	0	1	2	0	1	2
0	89	8	3	81	15	4
1	14	73	13	26	60	14
2	3	38	59	16	42	42

A comparison in terms of Accuracy, Precision, Recall, F1-score, and mAP between our best model variant and the available methods is depicted in Table IV and Table V for the AMI and EasyCome datasets respectively. For the AMI dataset, a direct comparison with other state-of-the-art methods is possible, as several previous works have reported results on most of the mentioned metrics. However, most of the works reported their results for the OSD task, therefore we adapted our multi-class CSD classification results into a binary OSD classification. This is done by aggregating the probabilities of classes #0 and #1, similarly, a VAD classification results can be obtained by aggregating the probabilities of classes #1 and #2.

The EasyCom dataset is relatively new, and to the best of our knowledge, no previous works have reported results to be addressed in this study. However, [15] offers models for the two related tasks of VAD and OSD ⁵⁵5Available on https://huggingface.co/pyannote. We used both models to extract the classification results for both tasks and combined the outputs to synthetically generate the results for the CSD task. This allows us to compare our results for the EasyCom dataset across all three important tasks - VAD, OSD, and CSD. In addition, we trained our previous proposed model from [21] for the EasyCom dataset and compared its performance to the proposed models in this paper. To gain a deeper analysis of our performance, we include the confusion matrix comparison of our best audio-visual model and the classification results obtained by [15], depicted in Table VI. Both Table V and Table VI show the difficulty of the EasyCom dataset for the discussed tasks, with lower performance than the AMI dataset. However, our audio-visual model seems to best handle this dataset, with higher values in most of the measured metrics. The confusion matrix demonstrates how the classification performance of [15] is highly biased toward class #1, while our performance is more balanced between the three classes.

TABLE IV: A comparison between the proposed model and various competing methods in evaluating the performance on the OSD task, including Accuracy (A), Precision (P), Recall (R), F1-score (F1) and mAP in (%) measures on the AMI dataset. Bold: best overall, underlined: best within modality.

Modalities	Method	A	P	R	F1	mAP
Audio	[14]	N/A	87.8	87	N/A	N/A
	[13]	N/A	87.8	87	N/A	60.3
	[21]	N/A	92.4	89	N/A	73.1
	[15]	N/A	80.7	70.5	75.3	N/A
	[23] (Single-Channel)	N/A	N/A	N/A	N/A	62.7
	[17] (close-talk mic)	N/A	N/A	N/A	80.4	N/A
	[18]	94.16	79.04	79.38	79.21	N/A
	Our Audio-Block	89.6	89.6	89.6	89.6	63
Visual	[23]	N/A	N/A	N/A	N/A	20
Visual	Our Visual-Blcok	80.9	87.6	80.9	83.2	51.6
Audio-Visual	[23]	N/A	N/A	N/A	N/A	67.2
Audio-Visual	Our Audio-Visual	85.4	87.5	85.4	86.3	53.1

TABLE V: A comparison between the proposed model and two available methods in evaluating the performance on the VAD, OSD, and CSD tasks, including Accuracy (A), Precision (P), Recall (R), F1-score (F1) and mAP in (%) measures on the EasyCom dataset.

	VAD					OSD					CSD
Method	A	P	R	F1	mAP	A	P	R	F1	mAP	A	P	R	F1	mAP
[21]	74.1	73.5	74.1	72.5	87.5	81.6	85.9	81.6	83.5	25	59.5	62.9	59.5	60.2	66.3
Using [15]	77.0	76.8	77.0	75.6	N/A	88.8	86.1	88.8	87.0	N/A	66.9	66.8	66.9	64.8	N/A
Our Audio-Block	76.8	77.2	76.8	77.0	89.1	82.5	85.5	82.5	83.9	25.0	59.8	64.9	59.8	61.0	66.9
Our Audio-Visual	79.0	81.2	79.0	79.4	92.8	90.0	98.0	90.0	86.6	32.8	70.4	69.6	70.4	67.9	71.7

TABLE VI: EasyCom CSD comparison: confusion matrix comparison between the available method [15] and our audio-visual (AV) model, as [%] normalized to the ground-truth labels. ‘T’-true labels, ‘P’-predicted labels.

	Our AV model			[15]
T \P	0	1	2	0	1	2
0	81	15	4	50	48	2
1	15	60	14	10	87	3
2	16	42	42	3	78	19

IV-D Ablation Study

We conducted an ablation study to analyze the impact of two key components on the performance of the proposed model: one related to the training process and the other related to the model architecture itself. For the training process, various data augmentation techniques were applied to the training data, as discussed in Section III-B. The model was trained both with and without these augmentation techniques to assess the influence of this training process component on the classification performance. Regarding the model architecture, two scenarios were considered: training the weights of the backbone feature extraction models and freezing the weights of the backbone models. In the former case, the pre-trained backbone models were allowed to update their weights during the training process, using a different learning rate than the rest of the layers, as discussed in Section IV-B. In the latter case, only the rest of the model’s layers were trained, kee** the weights of backbone models fixed. Table VII presents the 4 comparisons, evaluated on the EasyCom dataset.

TABLE VII: Ablation study: A comparison of the proposed audio-visual model with and without data augmentations and backbones training, evaluating the performance on the VAD, OSD, and CSD task, including Accuracy (A), Precision (P), Recall (R), F1-score (F1), and mAP (%) measures on the EasyCom dataset.

		VAD					OSD					CSD
Data augmentations	Backbone training	A	P	R	F1	mAP	A	P	R	F1	mAP	A	P	R	F1	mAP
✗	✗	64.9	42.1	64.9	51.1	71.5	88.1	81.1	88.1	84.6	22.2	59.0	60.0	59.0	60.0	68.5
✓	✗	77.9	79.2	77.9	78.3	91.7	86.5	86.9	86.5	86.7	32.3	65.6	68.0	65.6	65.8	71.5
✗	✓	77.5	79.3	77.5	77.9	91.2	83.5	86.5	83.5	84.8	29.2	64.1	67.2	64.1	64.6	71.7
✓	✓	79.0	81.2	79.0	79.4	92.8	90.0	87.0	90.0	86.6	32.8	70.4	69.6	70.4	67.9	71.7

This table clearly shows how the combination of data augmentation and the backbones’ training (of both the audio and visual) enhance the overall performance of the model and classification results.

V Conclusions

In this study, we presented a comprehensive deep learning approach to the Concurrent Speaker Detection (CSD) task by leveraging multimodal audio-visual models. Our research contributes to the Socially Pertinent Robots in Gerontological Healthcare (SPRING) project, with the primary aim of enhancing the robustness and accuracy of CSD in complex, real-world environments such as public spaces and interactive meeting settings.

Our proposed models were evaluated on two real-world datasets, AMI and EasyCom, covering various aspects of audio-visual scenarios. We utilized the ’YOLO’ model for video preprocessing to extract face streams, enabling more accurate visual feature extraction. Additionally, we employed state-of-the-art audio and video backbones architectures in the model to ensure effective feature representation from both modalities. The model architecture integrated these features through a carefully designed fusion strategy, allowing for integrating and leveraging information from both audio and visual inputs. The model employs an early fusion strategy, combining audio and visual features through cross-modal attention mechanisms and subsequently refining the joint representations through stacked multimodal attention blocks. By incorporating the [CLS] token, the model effectively captures the audio-visual relationships relevant to the CSD task.

The results demonstrated that our multimodal approach showed a competitive performance on the AMI dataset. Notably, on the more challenging EasyCom dataset, our model achieved significant improvements.

The ablation studies confirmed the importance of both the data augmentation techniques and the use of differential learning rates for the audio and visual backbones compared to the remaining layers. The combination of these strategies notably enhanced the model’s performance, providing valuable insights into the training process and model architecture optimizations.

Our findings highlight the potential of multimodal audio-visual integration in the CSD task. Future work could explore further enhancements through even more sophisticated data augmentation techniques and alternative fusion strategies. These advancements could lead to even more robust models capable of handling a wider range of real-world scenarios, ultimately contributing to more effective and accurate CSD models.

References

[1] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner, Machine Learning for Multimodal Interaction. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, ch. The AMI Meeting Corpus: A Pre-announcement, pp. 28–39.
[2] J. Donley, V. Tourbabin, J.-S. Lee, M. Broyles, H. Jiang, J. Shen, M. Pantic, V. K. Ithapu, and R. Mehra, “Easycom: An augmented reality dataset to support algorithms for easy communication in noisy environments,” arXiv preprint arXiv:2107.04174, 2021.
[3] S. E. Chazan, J. Goldberger, and S. Gannot, “LCMV beamformer with DNN-based multichannel concurrent speakers detector,” in 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 1562–1566.
[4] M. Yousefi and J. H. Hansen, “Real-time speaker counting in a cocktail party scenario using attention-guided convolutional neural network,” arXiv preprint arXiv:2111.00316, 2021.
[5] N. Kanda, Y. Gaur, X. Wang, Z. Meng, Z. Chen, T. Zhou, and T. Yoshioka, “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” arXiv preprint arXiv:2006.10930, 2020.
[6] N. Sajjan, S. Ganesh, N. Sharma, S. Ganapathy, and N. Ryant, “Leveraging lstm models for overlap detection in multi-party meetings,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5249–5253.
[7] L. Bullock, H. Bredin, and L. P. Garcia-Perera, “Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7114–7118.
[8] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 21–25.
[9] A. Gillioz, J. Casas, E. Mugellini, and O. A. Khaled, “Overview of the transformer-based models for NLP tasks,” in 15th Conference on Computer Science and Information Systems (FedCSIS), 2020, pp. 179–183.
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems (NeurIPS), vol. 30, 2017.
[11] Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” in Proc. Interspeech, 2021, pp. 571–575.
[12] S. Cornell, M. Omologo, S. Squartini, and E. Vincent, “Detecting and counting overlap** speakers in distant speech scenarios,” in Proc. Interspeech, Shanghai, China, Oct. 2020.
[13] ——, “Overlapped speech detection and speaker counting using distant microphone arrays,” Computer Speech & Language, vol. 72, p. 101306, 2022.
[14] S. Zheng, S. Zhang, W. Huang, Q. Chen, H. Suo, M. Lei, J. Feng, and Z. Yan, “Beamtransformer: Microphone array-based overlap** speech detection,” arXiv preprint arXiv:2109.04049, 2021.
[15] H. Bredin and A. Laurent, “End-to-end speaker segmentation for overlap-aware resegmentation,” arXiv preprint arXiv:2104.04045, 2021.
[16] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[17] M. Lebourdais, T. Mariotte, M. Tahon, A. Larcher, A. Laurent, S. Montresor, S. Meignier, and J.-H. Thomas, “Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains,” arXiv preprint arXiv:2307.13012, 2023.
[18] M. Kunešová and Z. Zajíc, “Multitask detection of speaker changes, overlap** speech and voice activity using wav2vec 2.0,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
[19] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460. [Online]. Available: https://proceedings.neurips.cc/paper˙files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
[20] Z.-Q. Wang and D. Wang, “Count and separate: Incorporating speaker counting for continuous speaker separation,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 11–15.
[21] A. Eliav and S. Gannot, “Concurrent speaker detection: A multi-microphone transformer-based approach,” arXiv preprint arXiv:2403.06856, 2024.
[22] S. Cheng, Z. Ning, J. Hu, J. Liu, W. Yang, L. Wang, H. Yu, and W. Liu, “G-fusion: Lidar and camera feature fusion on the ground voxel space,” IEEE Access, vol. 12, pp. 4127–4138, 2024.
[23] M. Kyoung, H. Jeon, and K. Park, “Audio-visual overlapped speech detection for spontaneous distant speech,” IEEE Access, vol. 11, pp. 27 426–27 432, 2023.
[24] D. A. Mitchell and B. Rafaely, “Study of speaker localization under dynamic and reverberant environments,” arXiv preprint arXiv:2311.16927, 2023.
[25] C. Murdock, I. Ananthabhotla, H. Lu, and V. K. Ithapu, “Self-motion as supervision for egocentric audiovisual localization,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 7835–7839.
[26] G. Li, J. Deng, M. Geng, Z. **, T. Wang, S. Hu, M. Cui, H. Meng, and X. Liu, “Audio-visual end-to-end multi-channel speech separation, dereverberation and recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2707–2723, 2023.
[27] Z. Wang, S. Wu, H. Chen, M.-K. He, J. Du, C.-H. Lee, J. Chen, S. Watanabe, S. Siniscalchi, O. Scharenborg, D. Liu, B. Yin, J. Pan, J. Gao, and C. Liu, “The multimodal information based speech processing (misp) 2022 challenge: Audio-visual diarization and recognition,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
[28] S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, 2017. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253517300738
[29] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
[30] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[31] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
[32] R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing help?” Advances in neural information processing systems (NeurIPS), vol. 32, 2019.
[33] A. Galdran, J. Dolz, H. Chakor, H. Lombaert, and I. Ben Ayed, “Cost-sensitive regularization for diabetic retinopathy grading from eye fundus images,” in Medical Image Computing and Computer Assisted Intervention (MICCAI), 2020, pp. 665–674.
[34] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.

Audio-Visual Approach For Multimodal Concurrent Speaker Detection ††thanks: Identify applicable funding agency here. If none, delete this.