Text-Queried Target Sound Event Localization
Abstract
Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization.
Index Terms:
sound event localization and detection, multimodal fusionI Introduction
Sound event localization and detection (SELD) has attracted increasing interest recently owing to the launch of Detection and Classification of Acoustic Scenes and Events (DCASE) Task 3, whose objectives are to predict the temporal and spatial activities of sound events jointly. In previous editions, only audio modality (e.g. multi-channel audio in FOA and MIC format) is used. In DCASE 2023, videos are added to provide additional information and complementary modality. However, current task settings are constrained to limited class prediction. The number of classes in DCASE 2023 and L3DAS 2023 is 13, covering people speaking, laughter and music, etc, which follow the AudioSet [1] ontology.
Text-prompt based tasks have become prevalent recently as the model works according to the text description and is not constrained to limited classes. In the visual domain, text is used to describe the visual objects for tracking in the image plane in [2]. Imagic is proposed in [3] for text-conditioned image editing. In the audio domain, text queried audio tagging is introduced in [4]. In [5], general sound synthesis is based on the text input. In [6, 7], general source separation is conducted based on the text description and is not constrained to specific classes. Visual modality is further added in [8] to assist text-prompted source separation. In [9], text-queried separation is extended to the speech domain. Users can type to describe the speaker they want to separate, such as ‘the loudest speaker’, ‘the female speaker’, etc. In [10], text-queried target speech diarization is explored and text is used to choose the target speaker such as ‘the person who spoke at two seconds’, ‘the male speaker’ or ‘the keynote speaker’.
Inspired by the text-driven tasks, we propose the text-queried target sound event localization, which takes the text description as input and outputs the position of the sound event related to the description, as illustrated in Fig. 1. The proposed task requires the model to explore the relationship between the textural information and the spatial information, and locate the target source based on multi-modal correspondence. Some contrastive pretrained models have been proposed to explore the semantic relationships between different modalities. Contrastive Language–Image Pretraining (CLIP) [11] learns the visual-textual semantic information in a self-supervised way. WAV2CLIP [12] leverages the pretrained CLIP and distills knowledge from the visual-textual domain into the audio-textual domain. In [13], a contrastive language-audio pretraining model (CLAP) is proposed to model the audio-textual representation. Here, we explore how the pretrained multimodal representation models help our proposed task. To our knowledge, this is the first attempt to explore the open set text-queried target sound source localization.
![Refer to caption](x1.png)
The remainder of the paper is organised as follows. Section II recaps the relevant literature such as sound event localization and detection, and multimodal representation learning. Section III formulates the problem. Section IV introduces the proposed multimodal fusion model. Section V provides the training details and Section VI presents and analyzes the experiments. Section VII points out the limitations and concludes the paper.
II Related Work
II-A Sound Event Localization and Detection
There are some traditional sound source localization algorithms such as multiple signal classification (MUSIC) and the generalized cross-correlation phase transform (GCC-PHAT). With the development of deep learning techniques, an increasing number of learning-based methods have been developed. An overview of the parametric-based methods and the learning-based methods can be found in [14]. In [15], SELDnet was proposed to predict the sound event class and the related positions recurrently. In [15], an activity-coupled Cartesian DOA (ACCDOA) representation is proposed to combine the training objectives of class prediction and location prediction. Multi-ACCDOA proposed in [16] extends ACCDOA to represent multiple sound events. In [17], a four-stage data augmentation method including audio channel swap**, multi-channel simulation, time-domain mixing and time-frequency masking is proposed to mitigate the data scarcity problem. In [18], Spatial cue-augmented log-spectrogram (SALSA) is proposed to integrate time-frequency features for sound event detection, and integrate magnitude or phase difference for sound event localization. A light-weighted version of SALSA is proposed in [19]. In [20], the event-independent network employs two branches to predict the sound events and locations, respectively, and uses the permutation invariant training method to find the most possible combination. Transformer is used for SELD in [21], which predicts the means and covariances of the sound event localizations via self-attention modules. In DCASE 2023 track b, videos are added to serve as complementary information. Visual bounding boxes are encoded as stacked Gaussian vectors in the baseline systems. In the state-of-the-art system[22], speech separation is employed to assist the SELD task and a video pixel-swap** data augmentation technique is proposed. The object detection module is used to refine the localization results by aligning the DOA lines with bounding boxes. In [23], zero-shot and few-shot SELD are tackled to predict the unseen sound event classes.
II-B Multimodal Representation Learning
Large pretrained models learn multi-modal representations and show strong one-shot or zero-shot capabilities on the downstream tasks. CLIP [11] learns the visual representations under textual supervision using contrastive learning and exhibits superior zero-shot performance on downstream tasks such as classification and detection. Following the training paradigm of CLIP, GLIP [24] deals with language-aware object detection in a contrastive learning way. In addition, CLIP has been extended to other tasks such as video retrieval [25] and action recognition [26]. AudioCLIP [27] extends CLIP to the audio domain and shows competitive performance on the environmental sound classification. In addition to the CLIP-based model, there are other models for multimodal representation learning. In [28], ViLT learns visual-textual representation with lightweight modality embedding and emphasizes modality interactions. In [29], BEiT-3 jointly trains image, texts and image-text pairs with masked modeling and employs different experts for different tasks.
III Text Queried SED
In this section, we define the proposed novel task. Given a text prompt describing the sound event like ‘A vehicle is accelerating’ and a mixed audio , where is the number of channels and is the audio length, the objective is to predict the DOA of the target sound event. For static targets, we predict a single DOA value. For moving targets, we predict the trajectory of the sound event .
IV Model
![Refer to caption](x2.png)
The overall workflow of the proposed model is shown in Fig. 2, which contains three main parts, audio feature encoding, textual feature encoding and multimodal fusion.
IV-A Audio Feature Encoding
We employ GCC-PHAT as the frame-level audio feature, which contains the spatial information and the time lag between microphone pairs can be inferred by the response map. It calculates the generalized cross-correlation of the paired audio and is normalized by the magnitude, only retaining the phase information.
(1) |
where denotes the short-term Fourier transform, is the time lag, denotes the complex conjugate and is the frequency. Here, , where is the number of time frames and is the number of time lag intervals. A multi-channel feature with dimension can be formed by stacking from different microphone pairs, where is the number of microphone pairs.
To obtain the clip-level spatial feature , we use ResNet18 [30] to encode the frame-level feature and we extract features before the classification layer as follows,
(2) |
In addition to the spatial feature, we use CLAP [13] audio encoder to obtain the clip-level semantic feature as follows
(3) |
IV-B Text Feature Encoding
For clip-level textual feature, similar to the semantic audio feature, we employ the CLAP text encoder for extracting sound event-related information :
(4) |
IV-C Multimodal Fusion
For static sound events, we make clip-level predictions and use simple concatenations to fuse the audio and text information, which proves to be effective in target speaker extraction tasks [9], as follows
(5) |
For moving sound events, we make frame-level predictions. We employ the Transformer [31] for fusion and use the frame-level audio feature . To ensure the time consistency between the input feature and the output, the audio feature is used as the query () and the text feature is used as the value () in the cross attention module, i.e.
(6) |
IV-D Training objectives
The DOA estimate can be represented as a one-hot vector with and the model can be trained using cross entropy loss. However, cross entropy is generally used for classification tasks, treating each degree as an independent class. Here, we employ the earth mover’s distance (EMD) loss [32], which can model the relationship of different degrees. First, we encode the target DOA to a Gaussian distribution as the ground truth.
(7) |
where , is the mean and is a predefined covariance.
Let denote the output of the text queried SED system. After normalizing by softmax, the EMD loss is calculated as the accumulated difference between the encoded ground truth and , which measures the similarities between two discrete distributions, as follows
(8) |
where is the -th element of .
V Implementation Details
For calculating STFT, is set to 1024 with 640 hop size. The number of time lags in GCC-PHAT is set to 96. For extraction of audio and text semantic features, we use CLAP encoder in [13] to obtain and . For multimodal fusion, we use 4 encoders and 4 decoders in the transformer with 256 hidden dimensions and 1024 feedforward dimensions. For model training, the batch size is set to 64 with an initial learning rate of 5e-4. The learning rate decays by 0.5 after 20 epochs. The training process is monitored by the early stop mechanism with the patience of 10. The covariance in the EMD loss function is set to 5.
We use mean average error (MSE) to measure the model performance, which is calculated as follows.
(9) |
where is the number of time steps. The predicted azimuth is derived by for the predicted distribution. As the azimuth is continuous between 360 and 1, we use the remainder of 360 if the error is larger than 180.
![Refer to caption](x3.png)
VI Experiment Analysis
We evaluate the proposed methods in two ways. We perform evaluations for the baseline methods on the simulated dataset created by simulated RIR, where the sound sources are static. The dataset is simulated on the fly from AudioCaps [33]. In addition, the baseline methods are evaluated on the dataset created by real RIR, where the sound sources are either static or moving. Each audio clip is truncated or padded to 10 seconds. We follow the original dataset split of AudioCaps to generate the training, evaluation and test set. For both cases, we randomly choose two audio clips with their corresponding captions and randomly choose one caption for the query. The simulation process in both cases is shown in Fig. 3.
VI-A Evaluation on Simulated RIR
The simulated dataset is created by Pyroomacoustics [34] and the RIR is simulated by the image source model. The length, width and height of the room range from 5 meter to 15 meter. RT60 is randomly chosen in the range from 0.5 second to 1 second. The shape of the microphone array is a square with four microphones with a height from 1 meter to 1.2 meter. The inter-distance of each microphone ranges from 0.1 meter to 0.13 meter.
VI-B Evaluation on Real RIR
To evaluate the model performance on real scenarios, we also simulate the dataset using real RIR, which is created by the DCASE data generator [35] 111https://github.com/danielkrause/DCASE2022-data-generator. It is widely used in many SELD systems [22] for pretraining. The RIRs are collected in 10 more rooms in the Tampere University and different rooms have different reverberation conditions. The RIRs recorded in rooms 1, 2, 3, 4, 5, 6 and 10 are used for generating the training data. The RIRs recorded in rooms 8 and 9 are used for evaluation and test data generation, respectively. The data generator will produce the format of FOA and MIC multi-channel audio and we only adopt the FOA format, which is 4-channel spatial audio converted from 32-channel Eigenmike format [35]. We generate 7224, 903 and 903 audio clips for training, evaluation and testing, respectively.
Text Feature | Audio Feature | Fusion | Datasets | MAE () | ||
CLAP | GCC-PHAT | Concat |
|
1.34 | ||
CLAP | GCC-PHAT | Concat | Two dir. | 32.15 | ||
CLAP |
|
Concat | Two dir. | 43.3 | ||
FlanT5 | GCC-PHAT | CA | Two dir. | 37.18 | ||
BERT | GCC-PHAT | CA | Two dir. | 33.61 |
Text Feature | Audio Feature | Fusion | MAE () | ||
CLAP | GCC-PHAT | Concat | 45.58 | ||
CLAP |
|
Concat | 47.92 | ||
FlanT5 | GCC-PHAT | CA | 48.71 | ||
BERT | GCC-PHAT | CA | 49.18 |
VI-C Experimental Results
The MAE results on the simulated dataset using simulated RIR are shown in TABLE I. As there are no previous work on text-queried target sound source localization, we compare different combinations of audio feature, text feature and fusion methods. We compare clip-level fusion and frame-level fusion. Firstly, we evaluate the clip-level fusion model using CLAP + GCC-PHAT with concatenation on a simple scenario where there is only one directional sound source, and the other sound source is added as background noise. A fully connected layer is used as a classifier to obtain , with the concatenated features as the input. It can be seen that the model can accurately locate the sound source with 1.34 MAE error. When there are two directional sources, it is more difficult to localize the target sound source and the MAE error is larger. CLAP + GCC-PHAT with concatenation performs best and adding CLAP audio embedding does not improve the localization performance. In addition, we use frame-level cross attention fusion with FlanT5 [36] and BERT [37] text embedding as key. The maximium text length is set to 15 and the embedding dimensions are and with corresponding masks. A [CLS] token is added before the GCC-PHAT along the time dimension. The output token corresponding to the [CLS] token is used for prediction. However, the performance is not as good as that of the clip-level fusion method.
Evaluation results on the dataset created by real RIR are shown in TABLE II. Different from the dataset created simulated RIR, the sound sources are either moving or static and the target location is predicted at frame-level. We also explore the fusion methods such as concatenation and cross attention. For concatenation, LSTM is employed to take the concatenated input for frame-level prediction. It is clear that the dataset created by real RIR with moving trajectories is more challenging and the MAE error is larger than 40. We visualize the localization results in Fig. 4. The data sequence is generated by fold 8 through the DCASE data generator. For the given sequence, it can be seen that the performance of the clip-level fusion method is better than that of the frame-level method. From frame 28 to frame 58, the sound source moves from 340 to approximately 40. The clip-level fusion based methods are able to capture the movement while the frame-level fusion based methods fail.
![Refer to caption](extracted/5685959/figure/azimuth.png)
VII Conclusion
In this paper, we have presented a text-queried target sound source localization method and perform a benchmark study to explore the performance of the multimodal fusion baselines. Experimental results show that the concatenation of CLAP text embedding and GCC-PHAT achieves the best localization performance. As there are no existing datasets on the proposed novel task, we only evaluate the model performance on the simulated datasets. For future work, we will evaluate the model performance on real scenarios and improve the localization accuracy.
VIII Acknowledgement
This research was sponsored in part by Tencent AI Lab Rhino-Bird Gift Fund and in part by University of Surrey. The work of Xinyuan Qian was sponsored in part by CCF-Tencent Rhino-Bird Open Research Fund and National Natural Science Foundation of China, Grant No. 62306029.
References
- [1] J. F. Gemmeke et al., “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 776–780.
- [2] X. Wang et al., “Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 763–13 773.
- [3] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6007–6017.
- [4] A.-M. Oncescu, A. Koepke, J. F. Henriques, Z. Akata, and S. Albanie, “Audio retrieval with natural language queries,” arXiv preprint arXiv:2105.02192, 2021.
- [5] H. Liu et al., “Audioldm: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503, 2023.
- [6] X. Liu et al., “Separate what you describe: Language-queried audio source separation,” arXiv preprint arXiv:2203.15147, 2022.
- [7] ——, “Separate anything you describe,” arXiv preprint arXiv:2308.05037, 2023.
- [8] R. Tan, A. Ray, A. Burns, B. A. Plummer, J. Salamon, O. Nieto, B. Russell, and K. Saenko, “Language-guided audio-visual source separation via trimodal consistency,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 575–10 584.
- [9] X. Hao et al., “Ty** to listen at the cocktail party: Text-guided target speaker extraction,” arXiv preprint arXiv:2310.07284, 2023.
- [10] Y. Jiang et al., “Prompt-driven target speech diarization,” arXiv preprint arXiv:2310.14823, 2023.
- [11] A. Radford et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- [12] H.-H. Wu et al., “Wav2clip: Learning robust audio representations from clip,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 4563–4567.
- [13] Y. Wu et al., “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5.
- [14] J. Zhao, Y. Xu, X. Qian, D. Berghi, P. Wu, M. Cui, J. Sun, P. J. Jackson, and W. Wang, “Audio-visual speaker tracking: Progress, challenges, and future directions,” arXiv preprint arXiv:2310.14778, 2023.
- [15] S. Adavanne et al., “Sound event localization and detection of overlap** sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34–48, 2018.
- [16] K. Shimada et al., “Multi-accdoa: Localizing and detecting overlap** sounds from the same class with auxiliary duplicating permutation invariant training,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 316–320.
- [17] Q. Wang et al., “A four-stage data augmentation approach to resnet-conformer based acoustic modeling for sound event localization and detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1251–1264, 2023.
- [18] T. N. T. Nguyen et al., “Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1749–1762, 2022.
- [19] ——, “Salsa-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 716–720.
- [20] Y. Cao et al., “Event-independent network for polyphonic sound event localization and detection,” arXiv preprint arXiv:2010.00140, 2020.
- [21] C. Schymura, B. Bönninghoff, T. Ochiai, M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, and D. Kolossa, “Pilot: Introducing transformers for probabilistic sound event localization,” arXiv preprint arXiv:2106.03903, 2021.
- [22] Q. Wang et al., “The nerc-slip system for sound event localization and detection of dcase2022 challenge,” DCASE2022 Challenge, Tech. Rep., 2022.
- [23] K. Shimada et al., “Zero-and few-shot sound event localization and detection,” arXiv preprint arXiv:2309.09223, 2023.
- [24] L. H. Li et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
- [25] H. Luo et al., “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022.
- [26] M. Wang et al., “Actionclip: A new paradigm for video action recognition,” arXiv preprint arXiv:2109.08472, 2021.
- [27] A. Guzhov et al., “Audioclip: Extending clip to image, text and audio,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 976–980.
- [28] W. Kim et al., “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 5583–5594.
- [29] W. Wang et al., “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” arXiv preprint arXiv:2208.10442, 2022.
- [30] K. He et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- [31] A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- [32] M. Yu, C. Zhang, Y. Xu, S. Zhang, and D. Yu, “Metricnet: Towards improved modeling for non-intrusive speech quality assessment,” arXiv preprint arXiv:2104.01227, 2021.
- [33] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
- [34] R. Scheibler, E. Bezzam, and I. Dokmanić, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018, pp. 351–355.
- [35] A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen, “A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection,” arXiv preprint arXiv:2106.06999, 2021.
- [36] H. W. Chung et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
- [37] J. Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.