Sound Event Detection and Localization with Distance Estimation
thanks: The authors wish to thank CSC-IT Centre of Science Ltd., Finland, for providing computational resources.

Daniel Aleksander Krause, Archontis Politis, Annamaria Mesaros Faculty of Information Technology and Communication Sciences
Tampere University
Tampere, Finland
[email protected], [email protected], [email protected]
Abstract

Sound Event Detection and Localization (SELD) is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA). While this task has numerous applications and has been extensively researched in recent years, it fails to provide full information about the sound source position. In this paper, we overcome this problem by extending the task to Sound Event Detection, Localization with Distance Estimation (3D SELD). We study two ways of integrating distance estimation within the SELD core - a multi-task approach, in which the problem is tackled by a separate model output, and a single-task approach obtained by extending the multi-ACCDOA method to include distance information. We investigate both methods for the Ambisonic and binaural versions of STARSS23: Sony-TAU Realistic Spatial Soundscapes 2023. Moreover, our study involves experiments on the loss function related to the distance estimation part. Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.

Index Terms:
Sound event detection, sound source localization, sound distance estimation, Ambisonics, binaural recordings

I Introduction

Computational Auditory Scene Analysis (CASA) has emerged as a prominent area of study in recent years [1]. The automated examination of audio content holds substantial potential for diverse practical applications, including speech recognition [2], autonomous robots [3], surveillance systems [4], and support systems for the hearing-impaired [5]. CASA encompasses a spectrum of audio-related tasks, ranging from acoustic scene classification [6] and audio tagging [7] to sound source localization [8] and sound event detection [9]. Most tasks are currently approached by Deep Neural Network (DNN) models.

Although existing research has predominantly focused on individual tasks, an evolutionary progression towards the development of complex scene analysis systems involves the creation of models capable of simultaneously addressing multiple objectives. This progressive approach has been observed in recent research, as exemplified by the exploration of joint sound event detection and localization (SELD) [10]. SELD combines the task of identifying the temporal activities of sound events, altogether with their direction of arrival (DOA) and textual label. However, this approach does not take advantage of full spatial information by limiting it to the DOA only. In many cases, performing Sound Distance Estimation (SDE) would be also important to obtain the explicit position of the sound source in space.

Research on DNN-based SELD has shown multiple ways of solving the problem of matching two tasks for different scenarios. In [10], the authors proposed a Convolutional Recurrent Neural Network (CRNN) with a two branch solution, in which SED and DOA estimation are solved with independent classwise outputs. To allow for detection of multiple events at the same time, a track-wise output has been proposed in [11, 12]. The activity-coupled Cartesian DOA (ACCDOA) scales the Cartesian DOA vector by its corresponding event activity, overcoming the need for a multi-task approach [13]. Finally, the multi-ACCDOA method takes advantage of the trackwise approach and the ACCDOA format, allowing for independent detection of same-class events [14].

Regarding SDE, studies on DNN-based approaches have been limited mostly to the binaural format. Most of them use a classification approach, in which the distance is expressed as a finite set of pre-defined distances in the close area up to 4 meters [15, 16]. In [17], the authors studied multiple loss functions to perform distance estimation with an activity detection branch for a tetrahedral microphone array. Few studies investigated the performance of distance estimation in conjunction with DOA estimation [18, 19, 20], however no approach has been made to merge SDE with SELD.

In this paper, we investigate the joint task of Sound Event Detection, Localization and Distance Estimation. We study two ways of performing all three tasks jointly. First, we examine a multi-task approach, in which two separate branches are responsible for SELD and SDE. Second, we propose the multi activity-coupled Cartesian Distance and DOA (multi-ACCDDOA) method, which is an an extension of the known multi-ACCDOA format, by including the distance in the estimated vector. For both approaches, we study the influence of several loss functions to determine which is the most suitable for the joint task. Experiments are conducted for both First Order Ambisonics (FOA) and binaural recordings to investigate the potential performance of the task in a more limited audio format. To the authors’ knowledge, this is the first study investigating joint modelling of all three tasks.

II Method

II-A Features

TABLE I: Input parameters for the models.
Audio data format CH T F P
Ambisonics 7 250 64 [4, 4, 2]
Binaural 4 250 512 [8, 8, 4]

A feature input matrix of shape CH×T×F𝐶𝐻𝑇𝐹CH\times T\times Fitalic_C italic_H × italic_T × italic_F is fed to the model, where CH𝐶𝐻CHitalic_C italic_H, T𝑇Titalic_T and F𝐹Fitalic_F stand for the number of channels, time sequence length in frames and number of features respectively. The set of features utilized to train the models depends on the audio format under investigation as summarized in Table I. In this paper, we study the Ambisonic and binaural formats. Each file is split into clips of length T=250𝑇250T=250italic_T = 250. A complex spectrogram of the signal is obtained using a Short-Time Fourier Transform (STFT) with a Hamming window of length 40 ms and 50% overlap. This results in F=512𝐹512F=512italic_F = 512 frequency bins.

For Ambisonics, we accumulate the magnitude spectrograms from 4 channels into F=64𝐹64F=64italic_F = 64 mel energies. In order to explore spatial cues, we extract 3 intensity vector matrices as in [21], accumulated along the mel frequencies to fit the required number of features. The overall input matrix sums up to CH=7𝐶𝐻7CH=7italic_C italic_H = 7 feature channels.

For the binaural format, we extract the mean magnitude spectrogram from both binaural channels. To represent spatial information about the signal, we extract sines and cosines of Interaural Phase Differences (IPD), which provide a smooth representation of phase values and avoid phase wrap**. This feature has been shown to perform successfully in binaural DOA estimation and SDE [20, 22]. On top of that, we use ILDs, which constitute another major binaural cue that becomes important above 1.5 kHz. This set of features results in CH=4𝐶𝐻4CH=4italic_C italic_H = 4 feature channels that are fed to the model.

II-B Model

Refer to caption
Figure 1: Architecture of the deep neural network.

We employ a convolutional recurrent neural network (CRNN) model type which is common for SELD. To perform 3D SELD, we modify the model outputs to contain the included distance estimation part. The architecture of the utilized model is depicted in Fig. 1. First, the feature input matrix is processed by three 2D convolutional blocks, each consisting of 128 filter kernels, batch normalization and max-pooling across the feature dimension. Additionally, the first layer involves pooling across the time dimension with the rate of 5. The pooling rate across the featue dimension depends on the utilized feature set and is meant to reduce the feature dimension to 4 in the last convolutional block. Hence, for Ambisonics the rates are equal to P=[4,4,2]𝑃442P=[4,4,2]italic_P = [ 4 , 4 , 2 ], whereas for the binaural format it is P=[8,8,4]𝑃884P=[8,8,4]italic_P = [ 8 , 8 , 4 ]. Next, the feature maps are passed to two bi-directional gated recurrent units (GRUs) and two Multi-Head Attention layers, with 8 heads each.

The output of the model consists of Q𝑄Qitalic_Q branches, where Q𝑄Qitalic_Q depends on the number of performed tasks. Each branch is composed of two fully connected (FC) layers, the former containing 128 neurons, and the latter outputting Cqsubscript𝐶𝑞C_{q}italic_C start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT values. The parameters of the last layers are determined by the utilized method of including SDE within the 3D SELD framework. Here, we propose two techniques:

I. Multi-ACCDDOA: we modify the single task multi-ACCDOA approach proposed in [14]. Compared with the former, we extend the 3-element DOA vector to include the distance estimate as well. For N𝑁Nitalic_N tracks, C𝐶Citalic_C classes and T𝑇Titalic_T frames, we define the output as ynct=[anctRnct,Dnct]subscript𝑦𝑛𝑐𝑡subscript𝑎𝑛𝑐𝑡subscript𝑅𝑛𝑐𝑡subscript𝐷𝑛𝑐𝑡y_{nct}=[a_{nct}R_{nct},D_{nct}]italic_y start_POSTSUBSCRIPT italic_n italic_c italic_t end_POSTSUBSCRIPT = [ italic_a start_POSTSUBSCRIPT italic_n italic_c italic_t end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_n italic_c italic_t end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_n italic_c italic_t end_POSTSUBSCRIPT ], where n,c,t𝑛𝑐𝑡n,c,titalic_n , italic_c , italic_t indicate the output track number, target class and time frame, anct{0,1}subscript𝑎𝑛𝑐𝑡01a_{nct}\in\{0,1\}italic_a start_POSTSUBSCRIPT italic_n italic_c italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } stands for the detection activity, Rnct1,1subscript𝑅𝑛𝑐𝑡11R_{nct}\in\langle-1,1\rangleitalic_R start_POSTSUBSCRIPT italic_n italic_c italic_t end_POSTSUBSCRIPT ∈ ⟨ - 1 , 1 ⟩ are to the DOA vectors and Dnct0,)subscript𝐷𝑛𝑐𝑡0D_{nct}\in\langle 0,\infty)italic_D start_POSTSUBSCRIPT italic_n italic_c italic_t end_POSTSUBSCRIPT ∈ ⟨ 0 , ∞ ) corresponds to distance values. The dimensions hold the following characteristics: a, DNxCxT,R3xNxCxTformulae-sequencea, Dsuperscript𝑁𝑥𝐶𝑥𝑇Rsuperscript3𝑥𝑁𝑥𝐶𝑥𝑇\textbf{a, D}\in\mathbb{R}^{NxCxT},\textbf{R}\in\mathbb{R}^{3xNxCxT}a, D ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_x italic_C italic_x italic_T end_POSTSUPERSCRIPT , R ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_x italic_N italic_x italic_C italic_x italic_T end_POSTSUPERSCRIPT, and Rnct=1normsubscriptR𝑛𝑐𝑡1||\textbf{R}_{nct}||=1| | R start_POSTSUBSCRIPT italic_n italic_c italic_t end_POSTSUBSCRIPT | | = 1. We model up to N=3𝑁3N=3italic_N = 3, hence O1=156subscript𝑂1156O_{1}=156italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 156. The whole output is linear to contain the range of both DOA and distance values. The multi-ACCDDOA model is trained using Auxiliary Duplicating Permutation Invariant Training (ADPIT) as in [14]. The final loss function is defined as:

ADPIT=1CTcCtTminαPerm[ct]lα,ctACCDDOA,superscript𝐴𝐷𝑃𝐼𝑇1𝐶𝑇superscriptsubscript𝑐𝐶superscriptsubscript𝑡𝑇subscript𝛼𝑃𝑒𝑟𝑚delimited-[]𝑐𝑡superscriptsubscript𝑙𝛼𝑐𝑡𝐴𝐶𝐶𝐷𝐷𝑂𝐴\mathcal{L}^{ADPIT}=\frac{1}{CT}\sum_{c}^{C}\sum_{t}^{T}\min_{\alpha\in Perm[% ct]}l_{\alpha,ct}^{ACCDDOA},caligraphic_L start_POSTSUPERSCRIPT italic_A italic_D italic_P italic_I italic_T end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_C italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_min start_POSTSUBSCRIPT italic_α ∈ italic_P italic_e italic_r italic_m [ italic_c italic_t ] end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_α , italic_c italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_C italic_C italic_D italic_D italic_O italic_A end_POSTSUPERSCRIPT , (1)
lα,ctACCDDOA=1NnN(yα,nct,y^α,nct),superscriptsubscript𝑙𝛼𝑐𝑡𝐴𝐶𝐶𝐷𝐷𝑂𝐴1𝑁superscriptsubscript𝑛𝑁subscript𝑦𝛼𝑛𝑐𝑡subscript^𝑦𝛼𝑛𝑐𝑡l_{\alpha,ct}^{ACCDDOA}=\frac{1}{N}\sum_{n}^{N}\mathcal{L}(y_{\alpha,nct},\hat% {y}_{\alpha,nct}),italic_l start_POSTSUBSCRIPT italic_α , italic_c italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_C italic_C italic_D italic_D italic_O italic_A end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_L ( italic_y start_POSTSUBSCRIPT italic_α , italic_n italic_c italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_α , italic_n italic_c italic_t end_POSTSUBSCRIPT ) , (2)

where ()\mathcal{L}(\cdot)caligraphic_L ( ⋅ ) is a loss function of choice, α𝛼\alphaitalic_α is one possible track permutation and Perm[ct]𝑃𝑒𝑟𝑚delimited-[]𝑐𝑡Perm[ct]italic_P italic_e italic_r italic_m [ italic_c italic_t ] is the set of all possible permutations.

II. Multi-task (MT): here, the output is split into Q=2𝑄2Q=2italic_Q = 2 branches. The first branch performs SELD using the classwise ACCDOA approach as described in [13]. The output is defined as y1,ct=actRctsubscript𝑦1𝑐𝑡subscript𝑎𝑐𝑡subscript𝑅𝑐𝑡y_{1,ct}=a_{ct}R_{ct}italic_y start_POSTSUBSCRIPT 1 , italic_c italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_c italic_t end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_c italic_t end_POSTSUBSCRIPT. Since the models estimate the DOA vector as x,y,z𝑥𝑦𝑧x,y,zitalic_x , italic_y , italic_z Cartesian coordinates for each class and C=13𝐶13C=13italic_C = 13, this results in O1=39subscript𝑂139O_{1}=39italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 39. The second branch is responsible for classwise SDE, where y2,ct=Dctsubscript𝑦2𝑐𝑡subscript𝐷𝑐𝑡y_{2,ct}=D_{ct}italic_y start_POSTSUBSCRIPT 2 , italic_c italic_t end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_c italic_t end_POSTSUBSCRIPT and O2=13subscript𝑂213O_{2}=13italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 13. The ACCDOA output is normalized with a tanh activation, whereas the distance branch uses a Rectified Linear Unit. The matrix dimensions are defined as follows: a, DCxT,R3xCxTformulae-sequencea, Dsuperscript𝐶𝑥𝑇Rsuperscript3𝑥𝐶𝑥𝑇\textbf{a, D}\in\mathbb{R}^{CxT},\textbf{R}\in\mathbb{R}^{3xCxT}a, D ∈ blackboard_R start_POSTSUPERSCRIPT italic_C italic_x italic_T end_POSTSUPERSCRIPT , R ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_x italic_C italic_x italic_T end_POSTSUPERSCRIPT. The final loss is a sum of both branches:

MT=1CTcCtT(1(y1,ct,y^1,ct)+2(y2,ct,y^2,ct)),superscript𝑀𝑇1𝐶𝑇superscriptsubscript𝑐𝐶superscriptsubscript𝑡𝑇subscript1subscript𝑦1𝑐𝑡subscript^𝑦1𝑐𝑡subscript2subscript𝑦2𝑐𝑡subscript^𝑦2𝑐𝑡\mathcal{L}^{MT}=\frac{1}{CT}\sum_{c}^{C}\sum_{t}^{T}(\mathcal{L}_{1}(y_{1,ct}% ,\hat{y}_{1,ct})+\mathcal{L}_{2}(y_{2,ct},\hat{y}_{2,ct})),caligraphic_L start_POSTSUPERSCRIPT italic_M italic_T end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_C italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 , italic_c italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 , italic_c italic_t end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 , italic_c italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 2 , italic_c italic_t end_POSTSUBSCRIPT ) ) , (3)

where 1()subscript1\mathcal{L}_{1}(\cdot)caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) and 2()subscript2\mathcal{L}_{2}(\cdot)caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) are the losses of the both branches.

We note that the output of the MT approach does not allow for differentiating between overlap** sources of the same class, whereas the multi-ACCDDOA approach overcomes this problem. The output parameters are summarized in Table II. Models are implemented in PyTorch [23] and trained using the Adam optimizer for 250 epochs with 75 epochs of patience.

TABLE II: Output parameters for different models.
Method Q Oq Output activation
Multi-task 2 [39, 13] [tanh, ReLU]
Multi-ACCDDOA 1 156 linear

II-C Loss functions

In recent works on SELD using ACCDOA, the mean squared error (MSE) has been established as the most common loss function. However, the inclusion of the distance estimation part introduces a different value range, for which other loss functions might be more appropriate. The standard MSE function prioritizes sound sources which are further away from the origin, since large distances create a more significant error. In [17] the authors introduced a relative regressor function, which evens out the error across ground truth distances, therefore penalizing the SDE branch more fairly. Here, we investigate the proposed loss functions for a joint 3D SELD model. The investigated functions include:

  • Mean Squared Error:

    MSE=1Mm=0M1(y[m]y^[m])2,𝑀𝑆𝐸1𝑀superscriptsubscript𝑚0𝑀1superscript𝑦delimited-[]𝑚^𝑦delimited-[]𝑚2MSE=\frac{1}{M}\sum_{m=0}^{M-1}(y[m]-\hat{y}[m])^{2},italic_M italic_S italic_E = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ( italic_y [ italic_m ] - over^ start_ARG italic_y end_ARG [ italic_m ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

  • Mean Absolute Error:

    MAE=1Mm=0M1|y[m]y^[m]|,𝑀𝐴𝐸1𝑀superscriptsubscript𝑚0𝑀1𝑦delimited-[]𝑚^𝑦delimited-[]𝑚MAE=\frac{1}{M}\sum_{m=0}^{M-1}|y[m]-\hat{y}[m]|,italic_M italic_A italic_E = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT | italic_y [ italic_m ] - over^ start_ARG italic_y end_ARG [ italic_m ] | ,

  • Mean Square Percent Error:

    MSPE=1Mm=0M1(y[m]y^[m]y^[m])2,𝑀𝑆𝑃𝐸1𝑀superscriptsubscript𝑚0𝑀1superscript𝑦delimited-[]𝑚^𝑦delimited-[]𝑚^𝑦delimited-[]𝑚2MSPE=\frac{1}{M}\sum_{m=0}^{M-1}(\frac{y[m]-\hat{y}[m]}{\hat{y}[m]})^{2},italic_M italic_S italic_P italic_E = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ( divide start_ARG italic_y [ italic_m ] - over^ start_ARG italic_y end_ARG [ italic_m ] end_ARG start_ARG over^ start_ARG italic_y end_ARG [ italic_m ] end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

  • Mean Absolute Percent Error:

    MAPE=1Mm=0M1|y[m]y^[m]y^[m]|,𝑀𝐴𝑃𝐸1𝑀superscriptsubscript𝑚0𝑀1𝑦delimited-[]𝑚^𝑦delimited-[]𝑚^𝑦delimited-[]𝑚MAPE=\frac{1}{M}\sum_{m=0}^{M-1}|\frac{y[m]-\hat{y}[m]}{\hat{y}[m]}|,italic_M italic_A italic_P italic_E = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT | divide start_ARG italic_y [ italic_m ] - over^ start_ARG italic_y end_ARG [ italic_m ] end_ARG start_ARG over^ start_ARG italic_y end_ARG [ italic_m ] end_ARG | ,

where M𝑀Mitalic_M stands for the number of estimated values, y𝑦yitalic_y and y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG correspond to the predicted and ground truth values. For the MT model, we investigate all aforementioned loss functions for the SDE branch (2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). Since scaling the DOA vectors with distance values would put a larger weight to close sources, the ACCDOA branch (1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) keeps the MSE loss function for all scenarios. For similar reasons, the multi-ACCDDOA approach (\mathcal{L}caligraphic_L) is tested only for the MSE and MAE losses.

III Experiments

III-A Data

For experiments, we use the Ambisonic audio-only version of the STARSS23 dataset. The dataset includes 7 hours and 22 minutes of real recordings, which is split intro training data (90 clips) and testing data (78 clips). There are 13 sound event classes present, including female speech, male speech, clap**, telephone, laughter, domestic sounds, footsteps, door, music, musical instrument, water tap, bell and knock. The scenarios include up to 3 overlap** sound sources. For more details, refer to the original paper [24]. In order to increase the amount of training data, we synthesized more mixtures using the data generator described in [25]. The amount of additional data sums to 1200 clips of one-minute mixtures with the same maximum polyphony level as the original dataset. The data was synthesized using sounds from FSD50k [26]. In order to provide experiments for the binaural format, we convert the whole dataset from Ambisonic to binaural using the spaudiopy library [27]. Decoding is performed via magnitude least squared matching to a measured set of Head Related Transfer Functions as described in [28].

III-B Evaluation metrics

To evaluate our models, we use the SELD metrics from the DCASE Challenge 2023 Task 3. These include the Error Rate (ER), F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score for SED, and the DOA error and the localization recall for localization. The detection metrics are location aware, i.e., the events are counted only if the assigned source falls within ±20° of the ground truth DOA, whereas localization metrics are calculated only for true positives. On top of the well established SELD metrics, we add the distance error, which is defined as the mean absolute error between ground truth and predicted distances. The metrics are calculated in one second segments using micro-averaging and the matching between ground truth and predictions is done via the Hungarian algorithm referring to the angular distance between sources. For more details, see [29].

III-C Results

TABLE III: Results obtained for Ambisonics.
Method SELD loss Dist. loss ER F1[%]F_{1}[\%]italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ % ] DOA error [°] Recall [%][\%][ % ] Dist. error [m]
Multi-task MSE MSE 0.63 [0.59, 0.67] 41.4 [36.58, 47.04] 22.5 [19.18, 25.58] 61.0 [57.46, 65.57] 0.95 [0.82, 1.05]
MSE MAE 0.64 [0.60, 0.68] 43.6 [39.18, 48.60] 21.6 [18.63, 24.36] 41.10 [37.36, 45.64] 0.93 [0.80 , 1.02]
MSE MSPE 0.63 [0.59, 0.68] 44.1 [38.92, 48.72] 23.2 [18.31, 28.13] 64.7 [61.64, 68.68] 0.89 [0.77 , 0.99]
MSE MAPE 0.65 [0.61, 0.68] 43.5 [38.97, 47.80] 22.0 [18.94 , 24.80] 64.5 [61.18, 68.72] 0.88 [0.75, 0.97]
Multi-ACCDDOA MSE 0.65 [0.61, 0.70] 44.2 [39.45, 48.65] 22.9 [19.33 , 26.46] 68.4 [65.15, 72.33] 0.92 [0.80 , 1.01]
MAE 0.86 [0.82, 0.91] 21.5 [13.98, 28.47] 17.7 [14.09 , 21.05] 19.1 [12.44, 24.90] 0.74 [0.54 , 0.93]
TABLE IV: Results obtained for binaural audio.
Method SELD loss Dist. loss ER F1[%]F_{1}[\%]italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ % ] DOA error [°] Recall [%][\%][ % ] Dist. error [m]
Multi-task MSE MSE 0.82 [0.79, 0.86] 20.0 [15.40, 24.40] 41.1 [34.63 , 47.77] 45.6 [41.89, 49.02] 1.02 [0.90 , 1.12]
MSE MAE 0.85 [0.81, 0.87] 16.5 [13.91, 20.29] 38.6 [32.40 , 43.28] 36.7 [33.61, 40.89] 1.04 [0.90 , 1.15]
MSE MSPE 0.85 [0.81, 0.89] 19.3 [15.70, 23.91] 38.9 [31.91 , 44.28] 38.9 [35.52, 43.56] 1.01 [0.87 , 1.12]
MSE MAPE 0.87 [0.84, 0.91] 18.5 [15.16, 22.23] 38.1 [32.77 , 42.45] 42.2 [38.59, 45.83] 0.98 [0.86 , 1.09]
Multi-ACCDDOA MSE 0.87 [0.82, 0.91] 21.1 [17.38, 25.26] 39.7 [32.58 , 46.34] 48.0 [44.75, 51.25] 0.99 [0.88 , 1.09]
MAE 0.97 [0.94, 0.99] 5.4 [2.46, 8.24] 44.5 [36.84 , 51.93] 16.3 [10.12, 21.72] 0.75 [0.59 , 0.90]f

Tables III and IV show the results obtained for the Ambisonic and binaural datasets, respectively. The results are reported for the testing set using jackknife estimation to obtain confidence intervals with significance level of 0.05.

As can be seen from most metrics, the performance of the models are significantly worse for the binaural dataset than for the Ambisonic dataset. For the multi-task approach, the error rate went up from the range of [0.63, 0.65] to [0.82, 0.87], whereas the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score decrased from over 40% to the range of [16.5, 20.0]%. A similarly large effect can be seen for the localization metrics, where the DOA error ranges from 38.1° to 41.1° and the recall from 36.7% to 45.6%. This is a partly expected effect. Binaural recordings consist of only two channels as compared with four channels for FOA. Moreover, the cone of confusion effect and high directivity of the ears might impact the performance significantly. These problems can be largely overcome by utilizing a moving receiver as shown in [20, 22]. Interestingly, distance estimation has been affected to a much lesser extent than other tasks. For the MT approach, the distance error stays in the range between 0.98m to 1.04m, which is roughly a 10% increase from the Ambisonic results. These results indicate that efficient distance estimation can be achieved even with binaural recordings, which shows a potential for future research.

As can be seen for both audio formats, for the multi-task approach most differences between loss functions appear for the distance error. This is expected given the ACCDOA branch uses the same loss for all scenarios. For MSE and MAE, the Ambisonic model achieves a similar distance error of 0.95 m and 0.93 m, respectively. This is a reasonable performance compared with other metrics. Introducing the distance scaling brings the error further down, to 0.89m for MSPE and 0.88m for MAPE. Similarly, the binaural model achieves the lowest distance error of 0.98 m for MAPE. The SELD metrics show very limited influence of the distance loss on the training of the rest of the multi-task model. For FOA data, all scenarios achieve a fairly similar error rate between 0.63 and 0.65 and an F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score between 41.4% for MSE and 44.1% for MSPE. Interestingly, a higher impact can be seen for the localization metrics. Despite the DOA error staying between 21.6° and 23.2°, the recall varies between different scenarios. Compared with 61.0% for MSE, the MAE loss results in a recall of 41.1%, showing a decreased efficiency in estimated the correct number of sources. Also for binaural data, the MAE loss decreases the recall from 45.6% to 36.7%. However, for FOA the distance-scaled losses visibly improve the localization recall to 64.7% and 64.5% for MSPE and MAPE, respectively.

Larger differences between loss functions can be observed for the multi-ACCDDOA approach. The single task model trained with MSE achieved an ER and DOA error which are comparable with the multi-task approach. However, amongst all results obtained for the Ambisonic format, this model achieves the best F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of 44.2% and recall of 68.4%. The distance error of 0.92m is similar to the one achieved for the multi-task with MSE. Similarly, the binaural model obtains the best F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of 21.1% and recall of 48.0% using this method.

Unsurprisingly, changing the loss function to MAE affects all tasks to a much larger extent when using this method. First, the models achieves a distance error of 0.74m which is the best result across all experiments, showing that using the absolute error seems to be a better fit for estimating the distance. The loss affects positively the DOA error as well, for which the value goes down to 17.7°. However, detection metrics show a significant decrease of performance of the SED part - ER increases to 0.86, whereas the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score goes down to 21.5%. Similarly, recall goes down to 19.1%, which is the worst result overall. For binaural data, the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score drops to 5.4%, which is an unacceptable performance. Such large differences between MSE and MAE show a discrepancy between the SDE and SELD parts, for which different loss functions are optimal.

Across both datasets, best SELD performance is achieved for the multi-ACCDDOA approach using the MSE loss function. Regarding the fact that this method also allows for the detection of multiple overlap** of the same class, we consider this format superior to the multi-task approach. However, best distance error is achieved when using MAE loss. For future studies, we propose to investigate a mixed loss function, which would connect the benefits of using MSE for SELD and MAE for distance estimation. Alternatively, a different task definition might be used, combining the track-wise approach of multi-ACCDDOA with a multi-task output representation.

IV Conclusions

In this study, we investigate the joint task of Sound Event Detection, Localization and Distance Estimation. We propose two methods of solving this problem - using a multi-task classwise approach and using the multi-ACCDDOA method. The methods are investigated using several loss functions for an Ambisonic and a binaural dataset. Our experiments show best results when using a multi-ACCDDOA approach with the MSE loss function. However, we note a disparity between SELD and distance estimation, where the latter performs better when using a MAE loss. Future studies could propose a mixed loss function to further improve the results.

References

  • [1] T. Virtanen, M. D. Plumbley, and D. Ellis, Computational Analysis of Sound Scenes and Events.   Springer, 2018.
  • [2] M. Woelfel and J. McDonough, Distant speech recognition.   Wiley, 2009.
  • [3] J. Hornstein, M. Lopes, J. Santos-Victor, and F. Lacerda, “Sound localization for humanoid robots - building audio-motor maps based on the hrtf,” in 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006, pp. 1170–1176.
  • [4] K. Łopatka, J. Kotus, and A. Czyżewski, “Application of vector sensors to acoustic surveillance of a public interior space,” Archives of Acoustics, vol. 36, pp. 851–860, 2011.
  • [5] Y.-T. Peng, C.-Y. Lin, M.-T. Sun, and K.-C. Tsai, “Healthcare audio event classification using hidden markov models and hierarchical hidden markov models,” in 2009 IEEE International Conference on Multimedia and Expo, 2009, pp. 1218–1221.
  • [6] A. Mesaros, T. Heittola, and T. Virtanen, “Acoustic scene classification: An overview of DCASE 2017 challenge entries,” in 16th International Workshop on Acoustic Signal Enhancement (IWAENC 2018), 2018.
  • [7] E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis, X. Favory, J. Pons, and X. Serra, “General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline,” in Proc. of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 2018, pp. 69–73.
  • [8] D. Krause, A. Politis, and K. Kowalczyk, “Comparison of convolution types in CNN-based feature extraction for sound source localization,” in 28th European Signal Processing Conference (EUSIPCO 2020), 2020, pp. 820–824.
  • [9] A. Mesaros, A. Diment, B. Elizalde, T. Heittola, E. Vincent, B. Raj, and T. Virtanen, “Sound event detection in the DCASE 2017 Challenge,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 6, pp. 992–1006, 2019.
  • [10] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlap** sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, p. 34–48, 2019.
  • [11] Y. Cao, T. Iqbal, Q. Kong, F. An, W. Wang, and M. D. Plumbley, “An improved event-independent network for polyphonic sound event localization and detection,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 885–889.
  • [12] T. N. Tho Nguyen, D. L. Jones, and W.-S. Gan, “A sequence matching network for polyphonic sound event localization and detection,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 71–75.
  • [13] K. Shimada, Y. Koyama, N. Takahashi, S. Takahashi, and Y. Mitsufuji, “Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 915–919.
  • [14] K. Shimada, Y. Koyama, S. Takahashi, N. Takahashi, E. Tsunoo, and Y. Mitsufuji, “Multi-accdoa: Localizing and detecting overlap** sounds from the same class with auxiliary duplicating permutation invariant training,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 316–320.
  • [15] M. Yiwere and E. J. Rhee, “Sound source distance estimation using deep learning: An image classification approach,” Sensors, vol. 20, no. 1, 2020. [Online]. Available: https://www.mdpi.com/1424-8220/20/1/172
  • [16] A. Sobhdel, R. Razavi-Far, and S. Shahrivari, “Few-shot sound source distance estimation using relation networks,” 2021. [Online]. Available: https://arxiv.longhoe.net/abs/2109.10561
  • [17] S. S. Kushwaha, I. R. Román, M. Fuentes, and J. P. Bello, “Sound source distance estimation in diverse and dynamic acoustic conditions,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2023.   IEEE, pp. 1–5.
  • [18] M. Yiwere and E. J. Rhee, “Distance estimation and localization of sound sources in reverberant conditions using deep neural networks,” in 2017 International Journal of Applied Engineering Research, 2017.
  • [19] D. A. Krause, A. Politis, and A. Mesaros, “Joint direction and proximity classification of overlap** sound events from binaural audio,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).   IEEE, 2021, pp. 331–335.
  • [20] D. A. Krause, G. García-Barrios, A. Politis, and A. Mesaros, “Binaural sound source distance estimation and localization for a moving listener,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 996–1011, 2024.
  • [21] M. Yasuda, Y. Koizumi, S. Saito, H. Uematsu, and K. Imoto, “Sound event localization based on sound intensity vector refined by dnn-based denoising and source separation,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 651–655.
  • [22] G. García-Barrios, D. A. Krause, A. Politis, A. Mesaros, J. M. Gutiérrez-Arriola, and R. Fraile, “Binaural source localization using deep learning and head rotation information,” in 2022 30th European Signal Processing Conference (EUSIPCO).   IEEE, 2022, pp. 36–40.
  • [23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1912.01703
  • [24] K. Shimada, A. Politis, P. Sudarsanam, D. Krause, K. Uchida, S. Adavanne, A. Hakala, Y. Koyama, N. Takahashi, S. Takahashi, T. Virtanen, and Y. Mitsufuji, “STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” In arXiv e-prints: 2306.09126, 2023. [Online]. Available: https://arxiv.longhoe.net/abs/2306.09126
  • [25] A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen, “A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection,” in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), pp. 125–129.
  • [26] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2022.
  • [27] C. Hold, “spaudiopy,” 2023. [Online]. Available: https://github.com/chris-hld/spaudiopy/tree/master
  • [28] F. Zotter and M. Frank, Ambisonics.   Springer Cham, 2019.
  • [29] A. Politis, A. Mesaros, S. Adavanne, T. Heittola, and T. Virtanen, “Overview and evaluation of sound event localization and detection in dcase 2019,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1–14, 2020.