Text-Queried Target Sound Event Localization

1st Given Name Surname dept. name of organization (of Aff.)
name of organization (of Aff.)
City, Country
   **zheng Zhao, Xinyuan Qian, Yong Xu, Haohe Liu, Yin Cao§, Davide Berghi, Wenwu Wang
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK Department of Computer Science and Technology, University of Science and Technology Bei**g, China Tencent AI Lab, Bellevue, WA, USA §Department of Intelligent Science, Xi’an Jiaotong Liverpool University, China
Abstract

Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the text to describe the sound event, and the SEL model can predict the location of the related sound event. The proposed task presents a more user-friendly way for human-computer interaction. We provide a benchmark study for the proposed task and perform experiments on datasets created by simulated room impulse response (RIR) and real RIR to validate the effectiveness of the proposed methods. We hope that our benchmark will inspire the interest and additional research for text-queried sound source localization.

Index Terms:
sound event localization and detection, multimodal fusion

I Introduction

Sound event localization and detection (SELD) has attracted increasing interest recently owing to the launch of Detection and Classification of Acoustic Scenes and Events (DCASE) Task 3, whose objectives are to predict the temporal and spatial activities of sound events jointly. In previous editions, only audio modality (e.g. multi-channel audio in FOA and MIC format) is used. In DCASE 2023, 360°360°360\degree360 ° videos are added to provide additional information and complementary modality. However, current task settings are constrained to limited class prediction. The number of classes in DCASE 2023 and L3DAS 2023 is 13, covering people speaking, laughter and music, etc, which follow the AudioSet [1] ontology.

Text-prompt based tasks have become prevalent recently as the model works according to the text description and is not constrained to limited classes. In the visual domain, text is used to describe the visual objects for tracking in the image plane in [2]. Imagic is proposed in [3] for text-conditioned image editing. In the audio domain, text queried audio tagging is introduced in [4]. In [5], general sound synthesis is based on the text input. In [6, 7], general source separation is conducted based on the text description and is not constrained to specific classes. Visual modality is further added in [8] to assist text-prompted source separation. In [9], text-queried separation is extended to the speech domain. Users can type to describe the speaker they want to separate, such as ‘the loudest speaker’, ‘the female speaker’, etc. In [10], text-queried target speech diarization is explored and text is used to choose the target speaker such as ‘the person who spoke at two seconds’, ‘the male speaker’ or ‘the keynote speaker’.

Inspired by the text-driven tasks, we propose the text-queried target sound event localization, which takes the text description as input and outputs the position of the sound event related to the description, as illustrated in Fig. 1. The proposed task requires the model to explore the relationship between the textural information and the spatial information, and locate the target source based on multi-modal correspondence. Some contrastive pretrained models have been proposed to explore the semantic relationships between different modalities. Contrastive Language–Image Pretraining (CLIP) [11] learns the visual-textual semantic information in a self-supervised way. WAV2CLIP [12] leverages the pretrained CLIP and distills knowledge from the visual-textual domain into the audio-textual domain. In [13], a contrastive language-audio pretraining model (CLAP) is proposed to model the audio-textual representation. Here, we explore how the pretrained multimodal representation models help our proposed task. To our knowledge, this is the first attempt to explore the open set text-queried target sound source localization.

Refer to caption
Figure 1: An illustration of the proposed text-queried target sound event localization system. The SEL system takes the input of spatial audio and the user’s text description, and predicts the azimuth of the related sound source.

The remainder of the paper is organised as follows. Section II recaps the relevant literature such as sound event localization and detection, and multimodal representation learning. Section III formulates the problem. Section IV introduces the proposed multimodal fusion model. Section V provides the training details and Section VI presents and analyzes the experiments. Section VII points out the limitations and concludes the paper.

II Related Work

II-A Sound Event Localization and Detection

There are some traditional sound source localization algorithms such as multiple signal classification (MUSIC) and the generalized cross-correlation phase transform (GCC-PHAT). With the development of deep learning techniques, an increasing number of learning-based methods have been developed. An overview of the parametric-based methods and the learning-based methods can be found in [14]. In [15], SELDnet was proposed to predict the sound event class and the related positions recurrently. In [15], an activity-coupled Cartesian DOA (ACCDOA) representation is proposed to combine the training objectives of class prediction and location prediction. Multi-ACCDOA proposed in [16] extends ACCDOA to represent multiple sound events. In [17], a four-stage data augmentation method including audio channel swap**, multi-channel simulation, time-domain mixing and time-frequency masking is proposed to mitigate the data scarcity problem. In [18], Spatial cue-augmented log-spectrogram (SALSA) is proposed to integrate time-frequency features for sound event detection, and integrate magnitude or phase difference for sound event localization. A light-weighted version of SALSA is proposed in [19]. In [20], the event-independent network employs two branches to predict the sound events and locations, respectively, and uses the permutation invariant training method to find the most possible combination. Transformer is used for SELD in [21], which predicts the means and covariances of the sound event localizations via self-attention modules. In DCASE 2023 track b, 360°360°360\degree360 ° videos are added to serve as complementary information. Visual bounding boxes are encoded as stacked Gaussian vectors in the baseline systems. In the state-of-the-art system[22], speech separation is employed to assist the SELD task and a video pixel-swap** data augmentation technique is proposed. The object detection module is used to refine the localization results by aligning the DOA lines with bounding boxes. In [23], zero-shot and few-shot SELD are tackled to predict the unseen sound event classes.

II-B Multimodal Representation Learning

Large pretrained models learn multi-modal representations and show strong one-shot or zero-shot capabilities on the downstream tasks. CLIP [11] learns the visual representations under textual supervision using contrastive learning and exhibits superior zero-shot performance on downstream tasks such as classification and detection. Following the training paradigm of CLIP, GLIP [24] deals with language-aware object detection in a contrastive learning way. In addition, CLIP has been extended to other tasks such as video retrieval [25] and action recognition [26]. AudioCLIP [27] extends CLIP to the audio domain and shows competitive performance on the environmental sound classification. In addition to the CLIP-based model, there are other models for multimodal representation learning. In [28], ViLT learns visual-textual representation with lightweight modality embedding and emphasizes modality interactions. In [29], BEiT-3 jointly trains image, texts and image-text pairs with masked modeling and employs different experts for different tasks.

III Text Queried SED

In this section, we define the proposed novel task. Given a text prompt describing the sound event like ‘A vehicle is accelerating’ and a mixed audio 𝐚C×T𝐚superscript𝐶𝑇\mathbf{a}\in\mathbb{R}^{C\times T}bold_a ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_T end_POSTSUPERSCRIPT, where C𝐶Citalic_C is the number of channels and T𝑇Titalic_T is the audio length, the objective is to predict the DOA d𝑑ditalic_d of the target sound event. For static targets, we predict a single DOA value. For moving targets, we predict the trajectory of the sound event 𝐝T𝐝superscript𝑇\mathbf{d}\in\mathbb{R}^{T}bold_d ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

IV Model

Refer to caption
Figure 2: The proposed model consists of three parts, an audio encoder, a text encoder and the multimodal fusion module.

The overall workflow of the proposed model is shown in Fig. 2, which contains three main parts, audio feature encoding, textual feature encoding and multimodal fusion.

IV-A Audio Feature Encoding

We employ GCC-PHAT 𝐠(t,τ)𝐠𝑡𝜏\mathbf{g}(t,\tau)bold_g ( italic_t , italic_τ ) as the frame-level audio feature, which contains the spatial information and the time lag between microphone pairs can be inferred by the response map. It calculates the generalized cross-correlation of the paired audio (m,n)𝑚𝑛(m,n)( italic_m , italic_n ) and is normalized by the magnitude, only retaining the phase information.

𝐠(t,τ)=+Φm(t,f)Φn(t,f)|Φm(t,f)||Φn(t,f)|ej2πfτ𝑑f𝐠𝑡𝜏superscriptsubscriptsubscriptΦ𝑚𝑡𝑓superscriptsubscriptΦ𝑛𝑡𝑓subscriptΦ𝑚𝑡𝑓superscriptsubscriptΦ𝑛𝑡𝑓superscript𝑒𝑗2𝜋𝑓𝜏differential-d𝑓\mathbf{g}(t,\tau)=\int_{-\infty}^{+\infty}\frac{\Phi_{m}(t,f)\Phi_{n}^{*}(t,f% )}{\left|\Phi_{m}(t,f)\right|\left|\Phi_{n}^{*}(t,f)\right|}e^{j2\pi f\tau}dfbold_g ( italic_t , italic_τ ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT divide start_ARG roman_Φ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t , italic_f ) roman_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t , italic_f ) end_ARG start_ARG | roman_Φ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t , italic_f ) | | roman_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t , italic_f ) | end_ARG italic_e start_POSTSUPERSCRIPT italic_j 2 italic_π italic_f italic_τ end_POSTSUPERSCRIPT italic_d italic_f (1)

where ΦΦ\Phiroman_Φ denotes the short-term Fourier transform, τ𝜏\tauitalic_τ is the time lag, * denotes the complex conjugate and f𝑓fitalic_f is the frequency. Here, 𝐠(t,τ)T×F𝐠𝑡𝜏superscript𝑇𝐹\mathbf{g}(t,\tau)\in\mathbb{R}^{T\times F}bold_g ( italic_t , italic_τ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_F end_POSTSUPERSCRIPT, where T𝑇Titalic_T is the number of time frames and F𝐹Fitalic_F is the number of time lag intervals. A multi-channel feature with dimension L×T×Fsuperscript𝐿𝑇𝐹\mathbb{R}^{L\times T\times F}blackboard_R start_POSTSUPERSCRIPT italic_L × italic_T × italic_F end_POSTSUPERSCRIPT can be formed by stacking 𝐠𝐠\mathbf{g}bold_g from different microphone pairs, where L𝐿Litalic_L is the number of microphone pairs.

To obtain the clip-level spatial feature 𝐟aspasubscript𝐟subscript𝑎𝑠𝑝𝑎\mathbf{f}_{a_{spa}}bold_f start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we use ResNet18 [30] to encode the frame-level feature 𝐠𝐠\mathbf{g}bold_g and we extract features before the classification layer 𝐟aspa512subscript𝐟subscript𝑎𝑠𝑝𝑎superscript512\mathbf{f}_{a_{spa}}\in\mathbb{R}^{512}bold_f start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT as follows,

𝐟aspa=ResNet(𝐠)subscript𝐟subscript𝑎𝑠𝑝𝑎subscript𝑅𝑒𝑠𝑁𝑒𝑡𝐠\mathbf{f}_{a_{spa}}=\mathcal{F}_{ResNet}(\mathbf{g})bold_f start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_R italic_e italic_s italic_N italic_e italic_t end_POSTSUBSCRIPT ( bold_g ) (2)

In addition to the spatial feature, we use CLAP [13] audio encoder CLAPaudiosubscript𝐶𝐿𝐴𝑃𝑎𝑢𝑑𝑖𝑜\mathcal{F}_{CLAPaudio}caligraphic_F start_POSTSUBSCRIPT italic_C italic_L italic_A italic_P italic_a italic_u italic_d italic_i italic_o end_POSTSUBSCRIPT to obtain the clip-level semantic feature 𝐟asem512subscript𝐟subscript𝑎𝑠𝑒𝑚superscript512\mathbf{f}_{a_{sem}}\in\mathbb{R}^{512}bold_f start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT as follows

𝐟asem=CLAPaudio(𝐚)subscript𝐟subscript𝑎𝑠𝑒𝑚subscript𝐶𝐿𝐴𝑃𝑎𝑢𝑑𝑖𝑜𝐚\mathbf{f}_{a_{sem}}=\mathcal{F}_{CLAPaudio}(\mathbf{a})bold_f start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_C italic_L italic_A italic_P italic_a italic_u italic_d italic_i italic_o end_POSTSUBSCRIPT ( bold_a ) (3)

IV-B Text Feature Encoding

For clip-level textual feature, similar to the semantic audio feature, we employ the CLAP text encoder for extracting sound event-related information 𝐟tsem512subscript𝐟subscript𝑡𝑠𝑒𝑚superscript512\mathbf{f}_{t_{sem}}\in\mathbb{R}^{512}bold_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT:

𝐟tsem=CLAPtext(𝐚)subscript𝐟subscript𝑡𝑠𝑒𝑚subscript𝐶𝐿𝐴𝑃𝑡𝑒𝑥𝑡𝐚\mathbf{f}_{t_{sem}}=\mathcal{F}_{CLAPtext}(\mathbf{a})bold_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_C italic_L italic_A italic_P italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( bold_a ) (4)

IV-C Multimodal Fusion

For static sound events, we make clip-level predictions and use simple concatenations to fuse the audio and text information, which proves to be effective in target speaker extraction tasks [9], as follows

𝐟=[𝐟aspa;𝐟asem;𝐟tsem]𝐟subscript𝐟subscript𝑎𝑠𝑝𝑎subscript𝐟subscript𝑎𝑠𝑒𝑚subscript𝐟subscript𝑡𝑠𝑒𝑚\mathbf{f}=[\mathbf{f}_{a_{spa}};\mathbf{f}_{a_{sem}};\mathbf{f}_{t_{sem}}]bold_f = [ bold_f start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s italic_p italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_f start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] (5)

For moving sound events, we make frame-level predictions. We employ the Transformer [31] for fusion and use the frame-level audio feature 𝐠𝐠\mathbf{g}bold_g. To ensure the time consistency between the input feature and the output, the audio feature 𝐠𝐠\mathbf{g}bold_g is used as the query (𝐐𝐐\mathbf{Q}bold_Q) and the text feature is used as the value (𝐕𝐕\mathbf{V}bold_V) in the cross attention CA(𝐐,𝐊,𝐕)𝐶𝐴𝐐𝐊𝐕CA(\mathbf{Q},\mathbf{K},\mathbf{V})italic_C italic_A ( bold_Q , bold_K , bold_V ) module, i.e.

𝐟=CA(𝐠,𝐟tsem,𝐟tsem)𝐟𝐶𝐴𝐠subscript𝐟subscript𝑡𝑠𝑒𝑚subscript𝐟subscript𝑡𝑠𝑒𝑚\mathbf{f}=CA(\mathbf{g},\mathbf{f}_{t_{sem}},\mathbf{f}_{t_{sem}})bold_f = italic_C italic_A ( bold_g , bold_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (6)

IV-D Training objectives

The DOA estimate can be represented as a one-hot vector with 360superscript360\mathbb{R}^{360}blackboard_R start_POSTSUPERSCRIPT 360 end_POSTSUPERSCRIPT and the model can be trained using cross entropy loss. However, cross entropy is generally used for classification tasks, treating each degree as an independent class. Here, we employ the earth mover’s distance (EMD) loss [32], which can model the relationship of different degrees. First, we encode the target DOA d𝑑ditalic_d to a Gaussian distribution as the ground truth.

yi𝒩(d,σ2)similar-tosubscript𝑦𝑖𝒩𝑑superscript𝜎2y_{i}\sim\mathcal{N}(d,\sigma^{2})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_d , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (7)

where 1i3601𝑖3601\leq i\leq 3601 ≤ italic_i ≤ 360, d𝑑ditalic_d is the mean and σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a predefined covariance.

Let 𝐨360𝐨superscript360\mathbf{o}\in\mathbb{R}^{360}bold_o ∈ blackboard_R start_POSTSUPERSCRIPT 360 end_POSTSUPERSCRIPT denote the output of the text queried SED system. After normalizing 𝐨𝐨\mathbf{o}bold_o by softmax, the EMD loss is calculated as the accumulated difference between the encoded ground truth and 𝐨𝐨\mathbf{o}bold_o, which measures the similarities between two discrete distributions, as follows

lossEMD=i=1360|softmax(oi)yi|𝑙𝑜𝑠subscript𝑠𝐸𝑀𝐷superscriptsubscript𝑖1360𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑜𝑖subscript𝑦𝑖loss_{EMD}=\sum_{i=1}^{360}\left|softmax({o}_{i})-{y}_{i}\right|italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_E italic_M italic_D end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 360 end_POSTSUPERSCRIPT | italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | (8)

where oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th element of 𝐨𝐨\mathbf{o}bold_o.

V Implementation Details

For calculating STFT, n_fft𝑛_𝑓𝑓𝑡n\_fftitalic_n _ italic_f italic_f italic_t is set to 1024 with 640 hop size. The number of time lags in GCC-PHAT is set to 96. For extraction of audio and text semantic features, we use CLAP encoder in [13] to obtain 𝐟asemsubscript𝐟subscript𝑎𝑠𝑒𝑚\mathbf{f}_{a_{sem}}bold_f start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝐟tsem512subscript𝐟subscript𝑡𝑠𝑒𝑚superscript512\mathbf{f}_{t_{sem}}\in\mathbb{R}^{512}bold_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT. For multimodal fusion, we use 4 encoders and 4 decoders in the transformer with 256 hidden dimensions and 1024 feedforward dimensions. For model training, the batch size is set to 64 with an initial learning rate of 5e-4. The learning rate decays by 0.5 after 20 epochs. The training process is monitored by the early stop mechanism with the patience of 10. The covariance in the EMD loss function is set to 5.

We use mean average error (MSE) to measure the model performance, which is calculated as follows.

MSE=1Tt=0T|argmaxi(𝐨i,t)dt|𝑀𝑆𝐸1𝑇superscriptsubscript𝑡0𝑇subscriptargmax𝑖subscript𝐨𝑖𝑡subscript𝑑𝑡MSE=\frac{1}{T}\sum_{t=0}^{T}\left|\operatorname*{arg\,max}_{i}(\mathbf{o}_{i,% t})-d_{t}\right|italic_M italic_S italic_E = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_o start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | (9)

where T𝑇Titalic_T is the number of time steps. The predicted azimuth is derived by argmaxargmax\operatorname*{arg\,max}roman_arg roman_max for the predicted distribution. As the azimuth is continuous between 360°°\degree° and 1°°\degree°, we use the remainder of 360°°\degree° if the error is larger than 180°°\degree°.

Refer to caption
Figure 3: The simulation process using simulated RIR (the upper branch) and real RIR (the bottom branch).

VI Experiment Analysis

We evaluate the proposed methods in two ways. We perform evaluations for the baseline methods on the simulated dataset created by simulated RIR, where the sound sources are static. The dataset is simulated on the fly from AudioCaps [33]. In addition, the baseline methods are evaluated on the dataset created by real RIR, where the sound sources are either static or moving. Each audio clip is truncated or padded to 10 seconds. We follow the original dataset split of AudioCaps to generate the training, evaluation and test set. For both cases, we randomly choose two audio clips with their corresponding captions and randomly choose one caption for the query. The simulation process in both cases is shown in Fig. 3.

VI-A Evaluation on Simulated RIR

The simulated dataset is created by Pyroomacoustics [34] and the RIR is simulated by the image source model. The length, width and height of the room range from 5 meter to 15 meter. RT60 is randomly chosen in the range from 0.5 second to 1 second. The shape of the microphone array is a square with four microphones with a height from 1 meter to 1.2 meter. The inter-distance of each microphone ranges from 0.1 meter to 0.13 meter.

VI-B Evaluation on Real RIR

To evaluate the model performance on real scenarios, we also simulate the dataset using real RIR, which is created by the DCASE data generator [35] 111https://github.com/danielkrause/DCASE2022-data-generator. It is widely used in many SELD systems [22] for pretraining. The RIRs are collected in 10 more rooms in the Tampere University and different rooms have different reverberation conditions. The RIRs recorded in rooms 1, 2, 3, 4, 5, 6 and 10 are used for generating the training data. The RIRs recorded in rooms 8 and 9 are used for evaluation and test data generation, respectively. The data generator will produce the format of FOA and MIC multi-channel audio and we only adopt the FOA format, which is 4-channel spatial audio converted from 32-channel Eigenmike format [35]. We generate 7224, 903 and 903 audio clips for training, evaluation and testing, respectively.

TABLE I: Experimental results on the dataset created by simulated RIR, where sound events are static. ‘dir.’ denotes the directional source and ‘add.’ denotes the additive source. ‘CA’ means cross attention.
Text Feature Audio Feature Fusion Datasets MAE (°°\degree°)
CLAP GCC-PHAT Concat
One dir.,
One add.
1.34
CLAP GCC-PHAT Concat Two dir. 32.15
CLAP
GCC-PHAT,
CLAP
Concat Two dir. 43.3
FlanT5 GCC-PHAT CA Two dir. 37.18
BERT GCC-PHAT CA Two dir. 33.61
TABLE II: Experimental results on the dataset created by real RIR, where sound events are either static or moving.
Text Feature Audio Feature Fusion MAE (°°\degree°)
CLAP GCC-PHAT Concat 45.58
CLAP
GCC-PHAT,
CLAP
Concat 47.92
FlanT5 GCC-PHAT CA 48.71
BERT GCC-PHAT CA 49.18

VI-C Experimental Results

The MAE results on the simulated dataset using simulated RIR are shown in TABLE I. As there are no previous work on text-queried target sound source localization, we compare different combinations of audio feature, text feature and fusion methods. We compare clip-level fusion and frame-level fusion. Firstly, we evaluate the clip-level fusion model using CLAP + GCC-PHAT with concatenation on a simple scenario where there is only one directional sound source, and the other sound source is added as background noise. A fully connected layer is used as a classifier to obtain o𝑜oitalic_o, with the concatenated features as the input. It can be seen that the model can accurately locate the sound source with 1.34°°\degree° MAE error. When there are two directional sources, it is more difficult to localize the target sound source and the MAE error is larger. CLAP + GCC-PHAT with concatenation performs best and adding CLAP audio embedding does not improve the localization performance. In addition, we use frame-level cross attention fusion with FlanT5 [36] and BERT [37] text embedding as key. The maximium text length is set to 15 and the embedding dimensions are 15×102415102415\times 102415 × 1024 and 15×7681576815\times 76815 × 768 with corresponding masks. A [CLS] token is added before the GCC-PHAT along the time dimension. The output token corresponding to the [CLS] token is used for prediction. However, the performance is not as good as that of the clip-level fusion method.

Evaluation results on the dataset created by real RIR are shown in TABLE II. Different from the dataset created simulated RIR, the sound sources are either moving or static and the target location is predicted at frame-level. We also explore the fusion methods such as concatenation and cross attention. For concatenation, LSTM is employed to take the concatenated input for frame-level prediction. It is clear that the dataset created by real RIR with moving trajectories is more challenging and the MAE error is larger than 40°°\degree°. We visualize the localization results in Fig. 4. The data sequence is generated by fold 8 through the DCASE data generator. For the given sequence, it can be seen that the performance of the clip-level fusion method is better than that of the frame-level method. From frame 28 to frame 58, the sound source moves from 340°°\degree° to approximately 40°°\degree°. The clip-level fusion based methods are able to capture the movement while the frame-level fusion based methods fail.

Refer to caption
Figure 4: Visualization results of the clip-level fusion methods (the upper figure) and the frame-level fusion methods (the bottom figure).

VII Conclusion

In this paper, we have presented a text-queried target sound source localization method and perform a benchmark study to explore the performance of the multimodal fusion baselines. Experimental results show that the concatenation of CLAP text embedding and GCC-PHAT achieves the best localization performance. As there are no existing datasets on the proposed novel task, we only evaluate the model performance on the simulated datasets. For future work, we will evaluate the model performance on real scenarios and improve the localization accuracy.

VIII Acknowledgement

This research was sponsored in part by Tencent AI Lab Rhino-Bird Gift Fund and in part by University of Surrey. The work of Xinyuan Qian was sponsored in part by CCF-Tencent Rhino-Bird Open Research Fund and National Natural Science Foundation of China, Grant No. 62306029.

References

  • [1] J. F. Gemmeke et al., “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 776–780.
  • [2] X. Wang et al., “Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 763–13 773.
  • [3] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6007–6017.
  • [4] A.-M. Oncescu, A. Koepke, J. F. Henriques, Z. Akata, and S. Albanie, “Audio retrieval with natural language queries,” arXiv preprint arXiv:2105.02192, 2021.
  • [5] H. Liu et al., “Audioldm: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503, 2023.
  • [6] X. Liu et al., “Separate what you describe: Language-queried audio source separation,” arXiv preprint arXiv:2203.15147, 2022.
  • [7] ——, “Separate anything you describe,” arXiv preprint arXiv:2308.05037, 2023.
  • [8] R. Tan, A. Ray, A. Burns, B. A. Plummer, J. Salamon, O. Nieto, B. Russell, and K. Saenko, “Language-guided audio-visual source separation via trimodal consistency,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 575–10 584.
  • [9] X. Hao et al., “Ty** to listen at the cocktail party: Text-guided target speaker extraction,” arXiv preprint arXiv:2310.07284, 2023.
  • [10] Y. Jiang et al., “Prompt-driven target speech diarization,” arXiv preprint arXiv:2310.14823, 2023.
  • [11] A. Radford et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8748–8763.
  • [12] H.-H. Wu et al., “Wav2clip: Learning robust audio representations from clip,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 4563–4567.
  • [13] Y. Wu et al., “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1–5.
  • [14] J. Zhao, Y. Xu, X. Qian, D. Berghi, P. Wu, M. Cui, J. Sun, P. J. Jackson, and W. Wang, “Audio-visual speaker tracking: Progress, challenges, and future directions,” arXiv preprint arXiv:2310.14778, 2023.
  • [15] S. Adavanne et al., “Sound event localization and detection of overlap** sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34–48, 2018.
  • [16] K. Shimada et al., “Multi-accdoa: Localizing and detecting overlap** sounds from the same class with auxiliary duplicating permutation invariant training,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 316–320.
  • [17] Q. Wang et al., “A four-stage data augmentation approach to resnet-conformer based acoustic modeling for sound event localization and detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1251–1264, 2023.
  • [18] T. N. T. Nguyen et al., “Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1749–1762, 2022.
  • [19] ——, “Salsa-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 716–720.
  • [20] Y. Cao et al., “Event-independent network for polyphonic sound event localization and detection,” arXiv preprint arXiv:2010.00140, 2020.
  • [21] C. Schymura, B. Bönninghoff, T. Ochiai, M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, and D. Kolossa, “Pilot: Introducing transformers for probabilistic sound event localization,” arXiv preprint arXiv:2106.03903, 2021.
  • [22] Q. Wang et al., “The nerc-slip system for sound event localization and detection of dcase2022 challenge,” DCASE2022 Challenge, Tech. Rep., 2022.
  • [23] K. Shimada et al., “Zero-and few-shot sound event localization and detection,” arXiv preprint arXiv:2309.09223, 2023.
  • [24] L. H. Li et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
  • [25] H. Luo et al., “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022.
  • [26] M. Wang et al., “Actionclip: A new paradigm for video action recognition,” arXiv preprint arXiv:2109.08472, 2021.
  • [27] A. Guzhov et al., “Audioclip: Extending clip to image, text and audio,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 976–980.
  • [28] W. Kim et al., “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5583–5594.
  • [29] W. Wang et al., “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” arXiv preprint arXiv:2208.10442, 2022.
  • [30] K. He et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [31] A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [32] M. Yu, C. Zhang, Y. Xu, S. Zhang, and D. Yu, “Metricnet: Towards improved modeling for non-intrusive speech quality assessment,” arXiv preprint arXiv:2104.01227, 2021.
  • [33] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
  • [34] R. Scheibler, E. Bezzam, and I. Dokmanić, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2018, pp. 351–355.
  • [35] A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen, “A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection,” arXiv preprint arXiv:2106.06999, 2021.
  • [36] H. W. Chung et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  • [37] J. Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.