BERP: A Blind Estimator of Room Acoustic and Physical Parameters for Single-Channel Noisy Speech Signals

Lijun Wang 1 , Yixian Lu 1, Ziyan Gao  , Kai Li , Jianqiang Huang , Yuntao Kong , and Shogo Okada L. Wang, Y. Lu, Z. Gao, K. Li, J. Huang, Y. Kong and S. Okada are with School of Information Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan. email: {lijun.wang, ziyan-g, kai-li, jq.huang, okada-s, yuntao.kong}@jaist.ac.jp (Corresponding author: [email protected])Y. Lu is with ACES, Inc., Tokyo, Japan. email:  [email protected] work was partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI (grant numbers 23H03506).1Equal Contribution.11footnotemark: 11 The code and weights are available at https://github.com/Alizeded/BERP.
Abstract

Room acoustic parameters (RAPs) and room physical parameters ( RPPs) are essential metrics for parameterizing the room acoustical characteristics (RAC) of a sound field around a listener’s local environment, offering comprehensive indications for various applications. The current RAPs and RPPs estimation methods either fall short of covering broad real-world acoustic environments in the context of real background noise or lack universal frameworks for blindly estimating RAPs and RPPs from noisy single-channel speech signals, particularly sound source distances, direction-of-arrival (DOA) of sound sources, and occupancy levels. On the other hand, in this paper, we propose a novel universal blind estimation framework called the blind estimator of room acoustical and physical parameters (BERP), by introducing a new stochastic room impulse response (RIR) model, namely, the sparse stochastic impulse response (SSIR) model, and endowing the BERP with a unified encoder and multiple separate predictors to estimate RPPs and SSIR parameters in parallel. This estimation framework enables the computationally efficient and universal estimation of room parameters by solely using noisy single-channel speech signals. Finally, all the RAPs can be simultaneously derived from the RIRs synthesized from SSIR model with the estimated parameters. To evaluate the effectiveness of the proposed BERP and SSIR models, we compile a task-specific dataset from several publicly available datasets. The results reveal that the BERP achieves state-of-the-art (SOTA) performance. Moreover, the evaluation results pertaining to the SSIR RIR model also demonstrated its efficacy. The code is available on GitHub 22footnotemark: 2.

Index Terms:
Room acoustics, Room impulse response, Blind estimation, Reverberation time, Room acoustic parameters, Attention mechanism

I Introduction

Room acoustical characteristics (RACs) characterize the room acoustical properties through which people perceive the sound in an enclosure. RACs determine how intelligibly and clearly people perceive sound in an auditory space encompassed by the walls, ceilings, and furnishings. For instance, concert halls require clear and transparent sounds for music appreciation, whereas lecture rooms pursue intelligible delivery for lectures and public addresses. General auditoriums require the intelligible and easily audible sounds [1]. Local RACs are widely employed in speech enhancement, hearing aids, immersive audio, context-aware renderings (such as mixed reality and augmented reality), public address systems, and robotic systems. The dynamic parameterization of local RACs poses a significant challenge in room acoustics, given the interference caused by background environmental noise.

While a room impulse response (RIR) can fully represent a listener’s local RACs, it does not provide a direct interpretation of how the human perceives their local RACs, i.e., the subjective perception of the local RACs. Since speech intelligibility and sound clarity are subjective perceptions, listening experiments are typically conducted to assess them. However, conducting listening experiments is expensive and time-consuming, making them impractical to apply in public spaces [2]. Additionally, physical geometry-related information, such as room volumes, the distances of sound sources, and the corresponding orientations, which have critical applications in spatial audio rendering, intelligibility assessments in a room, sound source separation, audio navigation system, and speech enhancement [3, 4, 5, 6, 7, 8, 9], is lacking. Consequently, room acoustic parameters (RAPs) and room physical parameters (RPPs) have been used to model local RACs to offer clear and comprehensive indications for various applications, such as room acoustical assessment [10, 11, 12, 7, 13], speech enhancement [14, 15, 9], hearing aids [16, 17, 18, 19, 6, 20], sound source separation [3, 4], spatial audio rendering [21], context-aware rendering in extended reality (XR) and augmented reality (AR) [22, 5, 23], public address systems [24, 25], and robotics [26].

A few RAPs have been investigated and standardized [27, 11, 10, 12]. In IEC60268-16:2020, the speech transmission index (STI) is used to predict the speech intelligibility of an enclosure. The percentage articulation loss of consonants (%ALcons\%\rm{AL}_{\rm{cons}}% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT) [11] was studied to compensate for the limitations of STI, which has difficulty in reflecting the effect of linguistic information on the perception of intelligibility. The essential RAPs and their corresponding measurements, including reverberation time (T60subscript𝑇60T_{60}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT), early decay time (EDT), clarity (C80subscript𝐶80C_{80}italic_C start_POSTSUBSCRIPT 80 end_POSTSUBSCRIPT / C50subscript𝐶50C_{50}italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT), definition (D50subscript𝐷50D_{50}italic_D start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT), and center time (Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT), have been standardized in ISO 3382-1:2009 [12]. T60subscript𝑇60T_{60}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT is the most essential RAP for representing the RACs of an enclosure. RAPs can be directly derived from the measured RIR. RIR measurement necessitates excluding the people located in an enclosure, which is impractical for public spaces since RIR measurement requires high-energy sound [28]. Furthermore, RIR measurement is constrained with respect to capturing the dynamics of the local RACs, which vary according to the locations, arrangements, and quantities of the objects and occupants that are present. The RAPs measured using specific standards may differ from noncompliant measurements employed within the same enclosure. Therefore, a blind RAP estimation method is imperative, particularly in public spaces where people cannot be excluded. Several RPPs have been studied [27, 5, 6, 29, 30], such as the room volume, the sound source distance, and direction-of-arrival (DOA) of the sound source. The room volume is closely related to the RACs [27, 5, 6]. It may be derived from the measured RIR, but this process encounters the aforementioned issues. Moreover, the sound source distance and DOA are observer-dependent parameters. As a result, blind estimation methods have been proposed to obtain the RAPs and RPPs from observed signals. Blind estimation is a challenging task since it is an ill-posed problem that derives a system solely from an output without prior knowledge of the input.

Deep learning techniques are well-suited for constructing complex map**s between high-dimensional data acquired from messy realistic environments, often without explicit indications of relevance [31]. Hence, the common approach of blind estimation is to establish a map** from the observed signals to the output using deep learning techniques.

In terms of blind RAP estimation in scenarios with background noise, deep learning techniques are currently at the forefront of this field. Several methods utilizing fully convolutional neural networks (CNN) have achieved blind reverberation time and room volume estimation from the Gammatonegrams of single-channel noisy speech signals [32, 33] by leveraging the network architecture initially developed by Gamper and Tashev [34]. Furthermore, López et al. [35] and Callens et al. [36] introduced the convolutional recurrent neural networks (CRNN) architecture, which performed best in the ACE challenge [37], for universally estimating the reverberation time, clarity, and direct-to-reverberation ratio from the mel frequency cepstral coefficients (MFCC) of single-channel noisy speeches. Zheng et al. [38] proposed a CNN method with a gating mechanism that was designed for reverberation time estimation in noisy conditions using the spectrogram of the observed signal. Duangpummet et al. [39] developed a TAE-CNN architecture using the temporal amplitude envelope (TAE) of the observed signal, enabling the concurrent estimation of STI, reverberation time, clarity, definition, and center time. For the blind estimation of RPPs, a fully CNN architecture was employed to estimate room volumes from single-channel speech signals in [32, 33]. Additionally, the CRNN architecture was deployed to estimate sound source distances and the DOAs of sound sources from multi-channel speech signals [29].

The current methods, however, either fall short of covering a sufficiently broad RIR range to accommodate real-world scenarios to the greatest extent possible or rely on widely used image-source-based synthetic RIRs. It is difficult for these synthetic RIRs to accurately reflect complex real-world room geometries, and they do not emulate the natural decay properties of realistic RIRs, resulting in impacts on human perception [40]. Given that RAPs are employed to objectively assess human perceptions of RACs, the use of synthetic RIRs may introduce biases in perception evaluation [41]. Second, it is efficient to use limited data to train a universal architecture that is capable of simultaneously estimating RAPs and RPPs in a unified methodology, especially for instantaneous occupancy levels. Furthermore, to our best knowledge, no learning-based schemes are available for the blind estimation of the sound source distances and DOAs of sound sources from single-channel speech signals. Since it has been reported that single-channel acoustic cues can be used to estimate the DOAs [42, 43], it is possible to estimate sound source distances from the same cues by using deep learning techniques.

These gaps motivate us to propose a new method, a blind estimator of room acoustic and physical parameters (BERP), that can blindly estimate room parameters universally in various real-world acoustic environments with background environmental noise. We integrate a sparse stochastic impulse response (SSIR) model, a new stochastic RIR model, into the process of map** the desired RAPs and the observed speech signals. This RIR model fuses the distinct statistic properties, i.e., the sparse and dense statistical properties of different segments of realistic RIRs, to model realistic RIRs more accurately. The SSIR model can help simultaneously derive all RAPs without introducing additional complexity to the trainable model, by establishing only the map** between the parameters of the SSIR and the observed signals. Conversely, we directly establish the relationship between the RPPs and the observed signals by using neural networks.

Our work makes three important contributions to the temporary knowledge frontier, as follows:

  • A new stochastic RIR model is proposed to effectively model the realistic RIRs in terms of simultaneous RAP derivations.

  • Signal models for estimating observed speech signals at a listener’s local, especially the occupancy level estimation, and corresponding data synthesis pipelines are proposed.

  • A new universal blind estimation framework for blindly estimating RAPs and RPPs in parallel is proposed, which achieves state-of-the-art (SOTA) performance.

The rest of the paper is composed as follows: Section II briefs RAPs and RPPs. The proposed method is introduced in Section III and the corresponding experimental settings is outlined in Section IV. We discuss and conclude in Section V.

II Room Parameters

II-A Room Acoustic Parameters

Several RAPs that describes the RACs of an auditory space have been investigated and standardized [44, 27, 45, 46, 41, 10, 12]. The parameters that are widely used to parameterize room acoustics by audio engineers are briefly introduced.

II-A1 Intelligibility Parameters

Intelligibility parameters, including the STI and %ALcons\%\rm{AL}_{\rm{cons}}% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT, are used to predict speech intelligibility and assess verbal comprehension in a sound field.

Speech intelligibility index. The STI is employed to predict speech intelligibility and the corresponding listening difficulty in noisy surroundings. Houtgast and Steeneken initially defined the STI based on the modulation transfer function (MTF) [47, 48]. The higher the STI is, the more intelligible a sound field is. The STI can be calculated from the RIR as follows, which was standardized by IEC 60286-16:2020 [10].

First, an RIR h(t)𝑡h(t)italic_h ( italic_t ) passes through seven octave-band filters to obtain the MTFs at 14 specific modulation frequencies from the corresponding temporal envelopes as:

mk(fm,i)=0hi,oct2(t)exp(j2πfmt)𝑑t0hi,oct2(t)𝑑t,subscript𝑚𝑘subscript𝑓𝑚𝑖superscriptsubscript0superscriptsubscript𝑖oct2𝑡𝑗2𝜋subscript𝑓𝑚𝑡differential-d𝑡superscriptsubscript0superscriptsubscript𝑖oct2𝑡differential-d𝑡m_{k}(f_{m,i})=\frac{\int_{0}^{\infty}h_{i,\rm{oct}}^{2}(t)\exp(-j2\pi f_{m}t)% dt}{\int_{0}^{\infty}h_{i,\rm{oct}}^{2}(t)dt},italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) = divide start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i , roman_oct end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) roman_exp ( - italic_j 2 italic_π italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_t ) italic_d italic_t end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i , roman_oct end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) italic_d italic_t end_ARG , (1)

where i=1,2,,7.𝑖127i=1,2,...,7.italic_i = 1 , 2 , … , 7 ., k=1,,7𝑘17k=1,...,7italic_k = 1 , … , 7, hi,octsubscript𝑖octh_{i,\rm{oct}}italic_h start_POSTSUBSCRIPT italic_i , roman_oct end_POSTSUBSCRIPT denotes the RIR at each octave band, and mk(fm,i)subscript𝑚𝑘subscript𝑓𝑚𝑖m_{k}(f_{m,i})italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) represents the MTF for the i𝑖iitalic_i-th octave band at the k𝑘kitalic_k-th specific modulation frequency. The 14 specific modulation frequencies are fm,i(Hz)={0.63, 0.80, 1.00, 1.25, 1.60, 2.00, 2.50, 3.15, 4.00, 5.00, 6.30, 8.00, 10.00, 12.50}subscript𝑓𝑚𝑖Hz0.630.801.001.251.602.002.503.154.005.006.308.0010.0012.50f_{m,i}(\rm{Hz})=\{0.63,\ 0.80,\ 1.00,\ 1.25,\ 1.60,\ 2.00,\ 2.50,\ 3.15,\ 4.0% 0,\ 5.00,\\ \ 6.30,\ 8.00,\ 10.00,\ 12.50\}italic_f start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ( roman_Hz ) = { 0.63 , 0.80 , 1.00 , 1.25 , 1.60 , 2.00 , 2.50 , 3.15 , 4.00 , 5.00 , 6.30 , 8.00 , 10.00 , 12.50 }. Then, the modulation distortion ratio is calculated as follows:

Nk,i=10log10[mk(fm,i)1mk(fm,i)].subscript𝑁𝑘𝑖10subscript10subscript𝑚𝑘subscript𝑓𝑚𝑖1subscript𝑚𝑘subscript𝑓𝑚𝑖N_{k,i}=10\log_{10}\bigg{[}\frac{m_{k}(f_{m,i})}{1-m_{k}(f_{m,i})}\bigg{]}.italic_N start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT [ divide start_ARG italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT ) end_ARG ] . (2)

The transmission index at each octave band is normalized to the unit scale by limiting the range of Nk,isubscript𝑁𝑘𝑖N_{k,i}italic_N start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT relative to 15151515 dB, which is determined as:

T(k,i)={1,N(k,i)>15,N(k,i)+1530,15N(k,i)15,0,N(k,i)<15.𝑇𝑘𝑖cases1𝑁𝑘𝑖15otherwise𝑁𝑘𝑖153015𝑁𝑘𝑖15otherwise0𝑁𝑘𝑖15otherwiseT(k,i)=\begin{cases}1,\hskip 56.0pt\ N(k,i)>15,\\ \frac{N(k,i)+15}{30},\ -15\leq N(k,i)\leq 15,\\ 0,\hskip 55.0pt\ N(k,i)<-15.\end{cases}italic_T ( italic_k , italic_i ) = { start_ROW start_CELL 1 , italic_N ( italic_k , italic_i ) > 15 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_N ( italic_k , italic_i ) + 15 end_ARG start_ARG 30 end_ARG , - 15 ≤ italic_N ( italic_k , italic_i ) ≤ 15 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , italic_N ( italic_k , italic_i ) < - 15 . end_CELL start_CELL end_CELL end_ROW (3)

Finally, the STI is calculated as the weighted sum of T(k,i)𝑇𝑘𝑖T(k,i)italic_T ( italic_k , italic_i ):

STI=k=17wk[114i=114T(k,i)],STIsuperscriptsubscriptk17subscriptwkdelimited-[]114superscriptsubscripti114Tki\rm{STI}=\sum_{k=1}^{7}w_{k}\Big{[}\frac{1}{14}\sum_{i=1}^{14}T(k,i)\Big{]},roman_STI = ∑ start_POSTSUBSCRIPT roman_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT roman_w start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 14 end_ARG ∑ start_POSTSUBSCRIPT roman_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT roman_T ( roman_k , roman_i ) ] , (4)

where w={0.129, 0.143, 0.114, 0.114, 0.186, 0.171, 0.143}w0.1290.1430.1140.1140.1860.1710.143\rm{w}=\{0.129,\ 0.143,\ 0.114,\ 0.114,\ 0.186,\ 0.171,\ 0.143\}roman_w = { 0.129 , 0.143 , 0.114 , 0.114 , 0.186 , 0.171 , 0.143 }.

Percentage articulation loss of consonants. %ALcons\%\rm{AL}_{\rm{cons}}% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT accounts for the measurement of incorrectly understood consonants, and this measure was originally introduced by Peutz and Kelin [11]. Since the STI does not account for the way in which a listener’s proficiency and linguistic knowledge affect the intelligibility, %Acons\%\rm{A}_{\rm{cons}}% roman_A start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT assumes that consonants are harder to comprehend than vowels in a room. The utility of %Acons\%\rm{A}_{\rm{cons}}% roman_A start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT extends the limitations of the STI by not discounting significant intelligibility-related information and considering linguistic proficiency. Moreover, its robustness against contamination from guessing makes it a great indicator of speech intelligibility [6]. Thus, %ALcons\%\rm{AL}_{\rm{cons}}% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT emerges as an indispensable complement to the STI for comprehensively evaluating speech intelligibility within various room settings. %ALcons\%\rm{AL}_{\rm{cons}}% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT can be steadily calculated from the STI according to Farrell Becker’s empirical formula [49]:

%ALcons=170.5045e5.419STI.\%\rm{AL}_{\rm{cons}}=170.5045\cdot e^{-5.419\cdot\rm{STI}}.% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT = 170.5045 ⋅ roman_e start_POSTSUPERSCRIPT - 5.419 ⋅ roman_STI end_POSTSUPERSCRIPT . (5)

II-A2 Reverberation Parameters

The reverberation time (T60subscript𝑇60T_{60}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT) and early decay time (EDT) are pertinent to the reverberation and quantify the subjective impression of the vivacity of a sound field. T60subscript𝑇60T_{60}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT is the most essential RAP since it characterizes the physical properties of the RACs for which the reverberation energy is distributed within -60606060 dB. The EDT represents the decay time for the initial -10101010 dB to emphasize the more important contribution of early reflection to the perceived reverberation. T60subscript𝑇60T_{60}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT is the 60-dB decay time calculated by line-fitting to the proportion of the energy decay curve (EDC) of the RIR from 55-5- 5 dB to 3535-35- 35 dB and linearly extrapolating it to 6060-60- 60 dB. Similarly, the EDT is to line-fit the initial 1010-10- 10 dB of the EDC and to extrapolate to -60606060 dB.

II-A3 Energy Parameters

Clarity (C50subscript𝐶50C_{50}italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT and C80subscript𝐶80C_{80}italic_C start_POSTSUBSCRIPT 80 end_POSTSUBSCRIPT), definition (D50subscript𝐷50D_{50}italic_D start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT), and center time (Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) are the energy parameters used to measure the energy ratio of the RIR between the energy contributed from early reflections and late reverberation. They are strongly related to the impression of transparency.

Clarity. C80subscript𝐶80C_{80}italic_C start_POSTSUBSCRIPT 80 end_POSTSUBSCRIPT and C50subscript𝐶50C_{50}italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT express the logarithmetic ratio of the energy within the first 50505050 ms for speech and that within the first 80808080 ms for music to the remaining RIR, thereby characterizing the clarity perception of a speech or music signal traversing within a room. Clarity can be defined as:

Cte=10log10(0teh2(t)𝑑tteh2(t)𝑑t)subscript𝐶subscript𝑡𝑒10subscript10superscriptsubscript0subscript𝑡𝑒superscript2𝑡differential-d𝑡superscriptsubscriptsubscript𝑡𝑒superscript2𝑡differential-d𝑡C_{t_{e}}=10\log_{10}\Bigg{(}\frac{\int_{0}^{t_{e}}h^{2}(t)dt}{\int_{t_{e}}^{% \infty}h^{2}(t)dt}\Bigg{)}italic_C start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) italic_d italic_t end_ARG start_ARG ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) italic_d italic_t end_ARG ) (6)

where tesubscript𝑡𝑒t_{e}italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT denotes 50505050 or 80808080 ms, respectively.

Definition. D50subscript𝐷50D_{50}italic_D start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT indicates the subjective intelligibility of speech in a room, which is defined as the ratio of the energy received within 50505050 ms to the total energy of the RIR.

D50=050msh2(t)𝑑t0h2(t)𝑑t×100.subscript𝐷50superscriptsubscript050mssuperscript2𝑡differential-d𝑡superscriptsubscript0superscript2𝑡differential-d𝑡100D_{50}=\frac{\int_{0}^{50\ \rm{ms}}h^{2}(t)dt}{\int_{0}^{\infty}h^{2}(t)dt}% \times 100.italic_D start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT = divide start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 50 roman_ms end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) italic_d italic_t end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) italic_d italic_t end_ARG × 100 . (7)

Center time. Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT refers to “the center of gravity time”, characterizing the balance between clarity and reverberation that is related to speech intelligibility. Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is given by:

Ts=0th2(t)𝑑t0h2(t)𝑑t.subscript𝑇𝑠superscriptsubscript0𝑡superscript2𝑡differential-d𝑡superscriptsubscript0superscript2𝑡differential-d𝑡T_{s}=\frac{\int_{0}^{\infty}th^{2}(t)dt}{\int_{0}^{\infty}h^{2}(t)dt}.\vspace% {-1.5em}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_t italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) italic_d italic_t end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) italic_d italic_t end_ARG . (8)
Refer to caption
Figure 1: Overview of the architecture of the blind estimator of room acoustic and physical parameters (BERP). The input includes the observed speech signals within a room, which are the observed noisy and crowded reverberant speech signals, while the output contains the estimated RAPs and RPPs detailed in Section. II-A and II-B, respectively. The architecture can adapt to various input length without the need for length alignment. Fig. 7-9 shows the detailed architectures of the room feature encoder, Fig. 10 corresponds for the parametric predictor, and Fig. 12 is the architecture the acoustical bias corrector.

II-B Room Physical Parameters

RPPs are parameters related to the physical characteristics of a room. These parameters encompass the geometric room volume, sound source distance, DOA of the sound source, and instantaneous occupancy level around the listener’s location.

II-B1 Geometric Room Volume

The geometric room volume V𝑉Vitalic_V is a position-independent parameter for modeling the attributes of a room. V𝑉Vitalic_V is strongly related to the estimation of the critical distance (Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT), which is the distance from the sound source at which the energy density of the reverberant signal is equal to that of the direct signal [27]. Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be approximated using Sabine’s formula:

Dc=ϱA16π0.1ϱVπT60subscript𝐷𝑐italic-ϱ𝐴16𝜋0.1italic-ϱ𝑉𝜋subscript𝑇60D_{c}=\sqrt{\frac{\varrho A}{16\pi}}\approx 0.1\sqrt{\frac{\varrho V}{\pi T_{6% 0}}}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_ϱ italic_A end_ARG start_ARG 16 italic_π end_ARG end_ARG ≈ 0.1 square-root start_ARG divide start_ARG italic_ϱ italic_V end_ARG start_ARG italic_π italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT end_ARG end_ARG (9)

where ϱitalic-ϱ\varrhoitalic_ϱ signifies the source directivity factor, and A represents the equivalent absorption area of a room. Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is vital for determining whether a virtual sound source should be rendered with reverberation, thereby serving as a key distance cue for the perception of reverberation by the listener [27, 33].

Furthermore, the mixing time used in AR rendering applications [5] can be determined from V𝑉Vitalic_V as tm=Vsubscript𝑡𝑚𝑉t_{m}=\sqrt{V}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = square-root start_ARG italic_V end_ARG. Jot et al. [5] identified room volume as the reverberation fingerprint to characterize rooms for spatial AR rendering. V𝑉Vitalic_V also plays an important role in the speech intelligibility [6]. The critical distance of intelligibility (Dcisubscript𝐷𝑐𝑖D_{ci}italic_D start_POSTSUBSCRIPT italic_c italic_i end_POSTSUBSCRIPT), which acts as a distance cue for perceived intelligibility, is derived from V𝑉Vitalic_V as:

Dci=0.2ϱVT60.subscript𝐷𝑐𝑖0.2italic-ϱ𝑉subscript𝑇60D_{ci}=0.2\sqrt{\frac{\varrho V}{T_{60}}}.italic_D start_POSTSUBSCRIPT italic_c italic_i end_POSTSUBSCRIPT = 0.2 square-root start_ARG divide start_ARG italic_ϱ italic_V end_ARG start_ARG italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT end_ARG end_ARG . (10)

%ALcons\%\rm{AL}_{\rm{cons}}% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT also exhibits a strong relationship with V𝑉Vitalic_V, which can be alternatively expressed as[6]:

%ALcons=200D2T602ϱV+𝔠.{\%\rm{AL}_{\rm{cons}}}=\frac{200D^{2}T^{2}_{60}}{\varrho V}+\mathfrak{c}.% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT = divide start_ARG 200 italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT end_ARG start_ARG italic_ϱ italic_V end_ARG + fraktur_c . (11)

D𝐷Ditalic_D is the sound source distance, and 𝔠𝔠\mathfrak{c}fraktur_c is the correction factor.

II-B2 Sound Source Distance

The sound source distance D𝐷Ditalic_D contributes significantly to complementing the sound source localization (SSL) by integrating it with the DOA of the sound source [29]. The SSL is widely used in applications such as sound source separation [3], audio-oriented and navigational systems [8], speech-related applications [30], and human-robot interaction [26]. Furthermore, D𝐷Ditalic_D is intimately related to the perception of speech intelligibility, particularly in terms of %ALcons\%\rm{AL}_{\rm{cons}}% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT, as elaborated in Eq. (11).

II-B3 Direction-of-Arrival of the Sound Source

As mentioned in Section II-B2, the DOA is a crucial component of SSL [30], and it has several applications in sound source separation [4], speech recognition [20], speech enhancement [9], and room acoustical analysis [7]. In this work, the DOA is represented by a pair including an azimuth (θ𝜃\thetaitalic_θ) and elevation (ψ𝜓\psiitalic_ψ) and is denoted as DOA{θ,ψ}DOA𝜃𝜓\rm{DOA}\coloneqq\{\theta,\psi\}roman_DOA ≔ { italic_θ , italic_ψ }.

II-B4 Instantaneous Occupancy Level

The detection of the instantaneous occupancy level of room N𝑁Nitalic_N around a listener’s location is highly useful for several applications. It is commonly known that the number of occupants affects the reverberation [50], thus affecting the efficacy of demand-driven hearing aid systems and speech enhancement methods. Additionally, the interference speeches generated by the occupants around the listener affect the target signals that the listener intends to receive. Knowing the occupancy level helps control interference to achieve intelligible and clear transmission.

In the context of smart homes, the estimated number of occupants can optimize the control of demand-driven heating, ventilation, and air conditioning (HVAC) operations in the local space to significantly reduce the cost of building operations for sustainable smart buildings [51, 52]. In the XR and AR scenarios, the local occupancy level, as an important factor in environmental information factor, is fundamental for ensuring safe interaction in real-world scenes, especially in public spaces populated by others.

III Proposed Method

Overview: BERP. Fig. 1 shows the signal flow process within the proposed BERP framework. The input waveform is converted into a spectrogram-variant feature representation, which is subsequently fed into the room feature encoder (RFE). Finally, parametric predictors (PP) and a fully-connected (FC) layer output room parameters based on different estimation tasks for different real-world scenarios. When estimating RAPs and RPPs, except the occupancy level, noisy reverberant speech signals serve as the observed signal inputs for the featurizer. In contrast, when estimating the instantaneous occupancy level N𝑁Nitalic_N, the crowded reverberant speech signals are the inputs of the featurizer.

III-A Signal Models

III-A1 Noisy Reverberant Signal Model

The observed noisy reverberant signal, as perceived by a listener while transmitting from a speaker within a room and subject to the influence of the background environmental noise, can be formulated as:

ynr(t)=x(t)h(t)+n(t)subscript𝑦nr𝑡𝑥𝑡𝑡𝑛𝑡y_{\rm{nr}}(t)=x(t)*h(t)+n(t)italic_y start_POSTSUBSCRIPT roman_nr end_POSTSUBSCRIPT ( italic_t ) = italic_x ( italic_t ) ∗ italic_h ( italic_t ) + italic_n ( italic_t ) (12)

where ynr(t)subscript𝑦nr𝑡y_{\rm{nr}}(t)italic_y start_POSTSUBSCRIPT roman_nr end_POSTSUBSCRIPT ( italic_t ) denotes the noisy reverberant signal as perceived by the listener, h(t)𝑡h(t)italic_h ( italic_t ) denotes the RIR, and n(t)𝑛𝑡n(t)italic_n ( italic_t ) represents the background noise that is prevalent in the listener’s local surroundings. The symbol ``"``"``*"` ` ∗ " denotes the convolution operation.

The ynrsubscript𝑦nry_{\rm{nr}}italic_y start_POSTSUBSCRIPT roman_nr end_POSTSUBSCRIPT encapsulates the RIR information that fully characterizes the RACs in the listener’s local space, including RAPs and room volume. In addition, it contains information pertaining to sound source localization, such as the distance, azimuth, and elevation. Therefore, this signal model is instrumental for parameterizing the listener’s local acoustic space, encompassing aspects such as the volume, the distance and DOAs of the sound source, and RAPs. The noisy reverberant signal model is employed to model the real-world scenarios in which a listener interacts with a single speaker in the presence of environmental noise, as illustrated in Fig. 2.

Refer to caption
Figure 2: Illustration of noisy and crowded reverberant signals ynrsubscript𝑦nry_{\rm{nr}}italic_y start_POSTSUBSCRIPT roman_nr end_POSTSUBSCRIPT and ycrsubscript𝑦cry_{\rm{cr}}italic_y start_POSTSUBSCRIPT roman_cr end_POSTSUBSCRIPT, respectively, in real-world scenarios.

III-A2 Crowded Reverberant Signal Model

Currently, the research domain lacks a comprehensive reverberant speech corpus for crowded environments that can enable the estimation of the occupancy level around the listener, which encompasses complete meta-information, including the number of speakers, the spatial geometry of speaker distribution relative to the listener, and the local RACs where the listener occupies. We introduce a novel signal model that incorporates these detailed meta-information to address this gap, as shown in Fig 2.

This signal model can be expressed as:

ycr(t)subscript𝑦cr𝑡\displaystyle y_{\rm{cr}}(t)italic_y start_POSTSUBSCRIPT roman_cr end_POSTSUBSCRIPT ( italic_t ) =i=1N[d0diA0xi(t)h(t)]=[i=1Nd0diA0xi(t)]h(t),absentsuperscriptsubscript𝑖1𝑁delimited-[]subscript𝑑0subscript𝑑𝑖subscript𝐴0subscript𝑥𝑖𝑡𝑡delimited-[]superscriptsubscript𝑖1𝑁subscript𝑑0subscript𝑑𝑖subscript𝐴0subscript𝑥𝑖𝑡𝑡\displaystyle=\sum_{i=1}^{N}\Big{[}\frac{d_{0}}{d_{i}}A_{0}x_{i}(t)*h(t)\Big{]% }=\Big{[}\sum_{i=1}^{N}\frac{d_{0}}{d_{i}}A_{0}x_{i}(t)\Big{]}*h(t),= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ divide start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∗ italic_h ( italic_t ) ] = [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ] ∗ italic_h ( italic_t ) , (13)

where ycrsubscript𝑦cry_{\rm{cr}}italic_y start_POSTSUBSCRIPT roman_cr end_POSTSUBSCRIPT signifies the crowded reverberant speech signal, xi(t)subscript𝑥𝑖𝑡x_{i}(t)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) represents the speech signal originating from the i𝑖iitalic_i-th speaker proximal to the listener, and disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the distance between the i𝑖iitalic_i-th speaker and the listener, which adheres to a Gaussian distribution. A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the baseline amplitude observed at a distance of d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the listener, and h(t)𝑡h(t)italic_h ( italic_t ) denotes the RIR that delineates the acoustic characteristics of the local room. Here, d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is assumed to be equal to 1. N𝑁Nitalic_N represents the total count of speakers, i.e., the occupancy level, according to a gamma distribution, which is well-suited for modeling real-life events that yield only positive results.

When develo** this speech signal model, a set of fundamental assumptions are postulated. These assumptions are instrumental for enabling an approximation that closely mirrors real-world scenarios while effectively mitigating the intricacies embedded within the observed speech signal, thereby devising a theoretically sound and practically viable model.

Assumption 1.

We hypothesize that the maximum spatial extent surrounding the listener is approximately 6 meters, which is grounded in the fact that speech signals originating from the occupants near the listener undergo an attenuation of approximately 35 dB, rendering them nearly imperceptible as distinguishable speech and essentially inaudible[53, 54]. Thus, crowded speech signals radiating beyond this 6-meter threshold are considered background environmental noise.

Assumption 2.

The model assumes that the upper limit imposed on the number of speakers near the listener, i.e., N𝑁Nitalic_N, is restricted to 12121212. This premise is substantiated by empirical findings, which suggest that excessively overlap** concurrent speech signals tends to amalgamate into singular background noise, consequently diminishing their individual discernibility as separate speech elements.

Assumption 3.

In everyday settings, particularly within a confined small area such as an area possessing a semidiameter of 6 meters, it is more common for a listener to engage with approximately 3 to 4 occupants. Hence, it is postulated that within a zone rounded by a 6-meter semidiameter, the listener predominantly encounters an average of 4444 occupant speakers.

Refer to caption
Figure 3: Illustration of the image-source principle.

III-A3 Sparse Stochastic Impulse Response Model

Within the scope of dynamic blind RAPs and RPPs estimation, our access is confined to an observed noisy reverberant signal. Hence, we model the observed signal as in Eq. (12). The ill-posed nature of blind estimation necessitates an RIR model to approximate an unknown RIR for serving as a bridge between the sound source signal and the perceived noisy reverberant signal.

Moreover, to reduce the computational complexity and to facilitate their simultaneous estimation, it is more efficient to model the RIR and subsequently estimate the parameters of this RIR model. This approach enables the simultaneous derivation of the RAPs from the modeled RIR instead of directly estimating them from the noisy reverberant signal.

The RIR can be categorized as the isolated (early reflections) and dense room modes (late reverberation), respectively, by applying modal theory to the room frequency response [55], i.e., the Fourier transform of the RIR, by using Schroeder’s frequency [56].

Badeau [55] introduced a unified mathematical framework for stochastically modeling the RIR, according to the image-source principle [27], as shown in Fig. 3. This work reported that the image sources (i.e., reflections) are distributed according a uniform Poisson distribution. The author further demonstrated that this stochastic distribution of image sources remains invariant regardless of the sound source’s and the receiver’s locations. Alternatively, based on billiard theory, Polack [57] showed that the Poisson distribution of image sources is also independent of the room geometry. Additionally, Traer and McDermott [40] analyzed the RIR statistics. They found that, during the time interval of dense room modes, the RIR exhibits a Gaussian distribution; this was in stark contrast to the time interval of isolated room modes, which exhibited a non-Gaussian distribution.

Refer to caption
Figure 4: Fitting of the SSIR model to the realistic RIR. (a) The temporal envelope of the realistic and synthetic RIRs. (b) The corresponding realistic RIR.

Drawing inspirations of [57, 40, 55], we present a novel stochastic RIR model, namely, the sparse stochastic impulse response (SSIR) model. This model combines the different stochastic properties of the isolated and dense room modes of the RIR. Specifically, the time interval of the isolated room modes is dominated by uniform Poisson-distributed image sources with their sparsity proportional to the room volume as hi(t)𝑷(λ|V|)similar-tosubscripti𝑡𝑷𝜆𝑉h_{\rm{i}}(t)\sim\boldsymbol{P}(\lambda\lvert V\rvert)italic_h start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ( italic_t ) ∼ bold_italic_P ( italic_λ | italic_V | ). Conversely, the time interval related to the dense room modes presents a Gaussian distribution as hd(t)𝑵(0,1)similar-tosubscriptd𝑡𝑵01h_{\rm{d}}(t)\sim\boldsymbol{N}(0,1)italic_h start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( italic_t ) ∼ bold_italic_N ( 0 , 1 ). Here, hi(t)subscripti𝑡h_{\rm{i}}(t)italic_h start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ( italic_t ) and hd(t)subscriptd𝑡h_{\rm{d}}(t)italic_h start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( italic_t ) represent the isolated and dense room modes of the RIR, respectively, and V𝑉Vitalic_V denotes the room volume. Fig. 4 shows the fitting of the proposed SSIR model to the realistic RIR.

The SSIR model can be defined as:

hssir(t)={hi(t)=beαt/TiP(λ|V|),t[0,Ti)hd(t)=beαt/TdN(0,1),t[Ti,Td]subscriptssir𝑡casesformulae-sequencesubscripti𝑡direct-product𝑏superscript𝑒𝛼𝑡subscript𝑇𝑖P𝜆𝑉𝑡0subscript𝑇𝑖otherwiseformulae-sequencesubscriptd𝑡direct-product𝑏superscript𝑒𝛼𝑡subscript𝑇𝑑N01𝑡subscript𝑇𝑖subscript𝑇𝑑otherwiseh_{\rm{ssir}}(t)=\begin{cases}h_{\rm{i}}(t)=be^{\alpha t/T_{i}}\odot\textit{{P% }}(\lambda\lvert V\rvert),\hskip 15.0ptt\in[0,T_{i})\\ h_{\rm{d}}(t)=be^{-\alpha t/T_{d}}\odot\textit{{N}}(0,1),\hskip 9.0ptt\in[T_{i% },T_{d}]\end{cases}italic_h start_POSTSUBSCRIPT roman_ssir end_POSTSUBSCRIPT ( italic_t ) = { start_ROW start_CELL italic_h start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ( italic_t ) = italic_b italic_e start_POSTSUPERSCRIPT italic_α italic_t / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊙ P ( italic_λ | italic_V | ) , italic_t ∈ [ 0 , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( italic_t ) = italic_b italic_e start_POSTSUPERSCRIPT - italic_α italic_t / italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊙ N ( 0 , 1 ) , italic_t ∈ [ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] end_CELL start_CELL end_CELL end_ROW (14)
P(λ|V|)=λVeλV!,P𝜆𝑉superscript𝜆𝑉superscript𝑒𝜆𝑉\textit{{P}}(\lambda\lvert V\rvert)=\frac{\lambda^{V}\cdot e^{-\lambda}}{V!},P ( italic_λ | italic_V | ) = divide start_ARG italic_λ start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ⋅ italic_e start_POSTSUPERSCRIPT - italic_λ end_POSTSUPERSCRIPT end_ARG start_ARG italic_V ! end_ARG , (15)
N(0,1)=et2/22π,N01superscript𝑒superscript𝑡222𝜋\textit{{N}}(0,1)=\frac{e^{-t^{2}/2}}{\sqrt{2\pi}},N ( 0 , 1 ) = divide start_ARG italic_e start_POSTSUPERSCRIPT - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG , (16)

where Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Tdsubscript𝑇𝑑T_{d}italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT are two parameters that control the exponentially ascending and descending temporal envelopes of the RIR, respectively. The constant α=6.9𝛼6.9\alpha=6.9italic_α = 6.9 is known as Schroeder’s coefficient [58], and λ𝜆\lambdaitalic_λ, which is equal to μ𝜇\muitalic_μ, signifies the average of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT across the sample set. Here, μ𝜇\muitalic_μ is empirically determined to be 0.03990.03990.03990.0399.

III-B Datasets

A significant challenge encountered when using a data-driven method for the task of blind room acoustical estimation lies in the quality and coverage of the collected data, which are crucial for ensuring the capability of satisfactory generalization. Therefore, it is essential to construct a dataset characterized by large-scale quantity, substantial diversity, and detailed annotations of RAPs and RPPs. We collect adequate realistic RIRs, encompassing an extensive range of RIRs derived from various rooms with different volumes and geometries, distinct sound source and receiver locations, unique sound absorption coefficients of the room surfaces. Hence, it can contain a wide spectrum of broadband RAPs and RPPs. Furthermore, the dataset is augmented to refine the distribution of the annotations, thereby maximizing the diversity and representativeness of the dataset.

III-B1 RIR Data Collection

We aggregated five extensive realistic RIR datasets to construct a composite RIR dataset for representing a wide range of room geometries and RACs. These datasets are the Arni RIR dataset [59], the Motus dataset [60], the BUT ReverbDB [61], the ACE corpus of RIRs [37], and the OpenAIR dataset [62]. Each dataset comprises monochanneled and omnidirectionally recorded RIRs. We resampled all RIRs to 16161616 kHz.

III-B2 Speech and Noise Data Collection

To replicate the background environmental noise encountered in real-world scenarios, instead of using synthetic white Gaussian noise, we employ the actual noise samples from real-world daily life circumstances. We integrate noise signals from the DEMAND [63] and BUT[61] noise datasets, both of which are collected in real-world daily life environments and resampled to 16161616 kHz.

We use the LibriSpeech corpus [64] for sampling the sound source speech signals when synthesizing the observed reverberant signals. Specifically, we select a 360-hours clean subset. This subset is composed of more than 100,000 unique clips articulated by 921 speakers with completely distinct linguistic contents. The deployment of this dataset ensures a broad spectrum of diverse speech signals, enhancing the robustness and generalizability of our synthesized signals.

Refer to caption
Figure 5: Histogram of occupancy levels across crowded reverberant speech signals, adhering to a gamma distribution.

III-C Data Preparation

III-C1 Synthesis of Noisy Reverberant Speech Signals

In the composited RIR dataset with detailed annotations of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Tdsubscript𝑇𝑑T_{d}italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT; the parameters of the SSIR RIR model; and metrics related to the room volume V𝑉Vitalic_V, the sound source distance D𝐷Ditalic_D and DOA {θ𝜃\thetaitalic_θ, ψ𝜓\psiitalic_ψ} of the sound source, we further employ data augmentation strategy. The strategy involves data upsampling and downsampling techniques to modulate the distribution of the labels, which mitigates potential biases in the data distributions for obtaining more natural distributions. The degrees of upsampling and downsampling are calibrated based on the relative rarity of the values of each label. After the data augmentation process is applied to the RIR dataset, a comprehensive collection of 47,4304743047,43047 , 430 realistic RIRs is successfully compiled. This RIR dataset contains a wide range of RIRs, for which the corresponding T60subscript𝑇60T_{60}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT spans from 0.180.180.180.18 to 8.008.008.008.00 sec.

Then, we randomly sampled 47,430 clips from the LibriSpeech corpus, choosing clips with the most common length (from 12121212 to 17171717 sec.) to be the sound source speech signals, regardless of the speaker information and linguistic content that they contain. In parallel, noise signals are randomly sampled, following an independent and identically distributed (I.I.D.) pattern, from the DEMAND and BUT datasets. Then, in accordance with Eq. (12), we synthesize the noisy reverberant speech signals. To enhance the robustness and efficacy of the model across diverse noisiness environments, the signal-to-noise ratio (SNR) between the reverberant and noise signals is uniformly varied by adjusting the SNR at five different levels, ranging from 0 dB to 20 dB in 5 dB increments, including a scenario with no noise (Inf). Given the uniqueness of each clip, we guarantee that every synthesized speech signal maintains its individuality in terms of both its waveform and linguistic content, further augmenting the diversity and richness of the synthesized dataset.

III-C2 Synthesis of Crowded Reverberant Speech Signals

Initially, we apply voice activity detection to the LibriSpeech corpus to segment and annotate the timestamps corresponding to speech and silence segments. This process underlies the annotations of the synthesized crowded reverberant signals. Then, the gamma distribution of the occupancy levels is modeled. Explicitly, in rooms of varying volumes, the occupancy level N𝑁Nitalic_N follows a Gaussian distribution NN(N,1)similar-to𝑁N𝑁1N\sim\textit{{N}}(N,1)italic_N ∼ N ( italic_N , 1 ), accompanying with the real-world principle that larger spaces typically accommodate more occupants, while smaller spaces accommodate fewer occupants.

The Gaussian mixture distribution is used to approximate the gamma distribution of the occupancy level, as detailed in Eq. (13). Furthermore, the distribution of the distance disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the i𝑖iitalic_i-th occupant speaker to the listener is governed by a Gaussian distribution diN(μd,1)similar-tosubscript𝑑𝑖Nsubscript𝜇𝑑1d_{i}\sim\textit{{N}}(\mu_{d},1)italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ N ( italic_μ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , 1 ), where μdsubscript𝜇𝑑\mu_{d}italic_μ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT represents the mean of the maximum and minimum distances. In accordance with Assumption 1 (Sec. III-A2), μdsubscript𝜇𝑑\mu_{d}italic_μ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is set to 2.5.

Using Eq. (13), we synthesize the crowded reverberant speech signals by superimposing the speech signals uniformly sampled from the LibriSpeech corpus, aligning with the annotated speech and silence segmentation. The initiation index for each overlap** speech signal is determined based on an I.I.D. pattern. Additionally, to authentically replicate the local room acoustics, the room volumes in the RIRs are precisely matched with their corresponding RIRs, thereby ensuring a realistic acoustic environment. Finally, we obtain a dataset comprising 47,430 samples of crowded reverberant speech signals, ranging from 10 to 25 sec. Fig. 6 shows an example of a crowded reverberant speech signal and its corresponding occupancy level according to Eq. (13) and the aforementioned synthesis strategy.

Refer to caption
Figure 6: An example of a crowded reverberant speech signal. (a) The crowded reverberant signal. (b) The corresponding instantaneous occupancy level. (smoothed at frame 500 ms)

III-D Estimation Framework Architecture

III-D1 Featurization

We use the three types of featurization methods to represent the observed input signals, including Gammatonegram, MFCC, and mel spectrogram. A Gammatonegram emphasizes the importance of low-frequency sections while a signal propagates within a room [65, 34]. While the MFCC characterizes the shape of the spectral envelope of a reverberant signal, closely related to the MTF of room acoustics [58]. The mel spectrogram rather mimicks human subjective perceptions to the RACs.

III-D2 Room Feature Representation Learning

We use a room feature encoder (RFE) to learn room feature representations.

Room Feature Encoder. The RFE is structured into eight blocks, each block comprising four components. It incorporates a half-residual feedforward network, a multiheaded self attention, a convolutional network, and another half-residual feedforward network [66].

This encoder integrates the CNNs and transformer models, both of which account for gras** local and global acoustic features, respectively, since Wang et al. [28] showed that the acoustical information spreads the overall frequency components of the reverberant signal. Such integration makes it particularly well suited for learning the sophisticated map**s between the noisy and crowded reverberant speech signals with complex waveforms and the desired room parameters.

The signal flow from the input feature representation \mathbcalxi\mathbcalsubscript𝑥𝑖\mathbcal{x}_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the latent variable output \mathbcalyi\mathbcalsubscript𝑦𝑖\mathbcal{y}_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT across each block is mathematically expressed as:

\mathbcalxi=\mathbcalxi+12FFN(\mathbcalxi),\mathbcalsuperscriptsubscript𝑥𝑖\mathbcalsubscript𝑥𝑖12FFN\mathbcalsubscript𝑥𝑖\mathbcal{x}_{i}^{\ddagger}=\mathbcal{x}_{i}+\frac{1}{2}\cdot\textbf{FFN}(% \mathbcal{x}_{i}),italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ‡ end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ FFN ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (17)
\mathbcalxi=\mathbcalxi+LayerNorm[MHSA(\mathbcalxi)],\mathbcalsuperscriptsubscript𝑥𝑖absent\mathbcalsuperscriptsubscript𝑥𝑖LayerNormdelimited-[]MHSA\mathbcalsuperscriptsubscript𝑥𝑖\mathbcal{x}_{i}^{\ddagger\ddagger}=\mathbcal{x}_{i}^{\ddagger}+\textbf{% LayerNorm}[\textbf{MHSA}(\mathbcal{x}_{i}^{\ddagger})],italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ‡ ‡ end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ‡ end_POSTSUPERSCRIPT + LayerNorm [ MHSA ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ‡ end_POSTSUPERSCRIPT ) ] , (18)
\mathbcalxi=\mathbcalxi+Conv(\mathbcalxi),\mathbcalsuperscriptsubscript𝑥𝑖absent\mathbcalsuperscriptsubscript𝑥𝑖absentConv\mathbcalsuperscriptsubscript𝑥𝑖absent\mathbcal{x}_{i}^{\ddagger\ddagger\ddagger}=\mathbcal{x}_{i}^{\ddagger\ddagger% }+\textbf{Conv}(\mathbcal{x}_{i}^{\ddagger\ddagger}),italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ‡ ‡ ‡ end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ‡ ‡ end_POSTSUPERSCRIPT + Conv ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ‡ ‡ end_POSTSUPERSCRIPT ) , (19)
\mathbcalyi=LayerNorm[\mathbcalxi+12FFN(\mathbcalxi)],\mathbcalsubscript𝑦𝑖LayerNormdelimited-[]\mathbcalsuperscriptsubscript𝑥𝑖absent12FFN\mathbcalsuperscriptsubscript𝑥𝑖absent\mathbcal{y}_{i}=\textbf{LayerNorm}\Big{[}\mathbcal{x}_{i}^{\ddagger\ddagger% \ddagger}+\frac{1}{2}\cdot\textbf{FFN}(\mathbcal{x}_{i}^{\ddagger\ddagger% \ddagger})\Big{]},italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = LayerNorm [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ‡ ‡ ‡ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ FFN ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ‡ ‡ ‡ end_POSTSUPERSCRIPT ) ] , (20)

where FFN denotes the feedforward network, MHSA denotes the multiheaded self attention, Conv denotes the convolutional network, and LayerNorm represents the layer normalization operation, respectively.

Feedforward network. The feedforward network is composed of a layernorm layer, linear layers with 2048 hidden and 512 embedding dimensions with the swish activation function [67], and a dropout layer of 0.10.10.10.1 dropout rate. Fig. 7 visualizes the architecture of this module.

Refer to caption
Figure 7: Architecture of the feedforward network.

Multiheaded self attention. The multiheaded self attention with extrapolatable relative positional encoding (xPos) enhances the ability of the model to grasp the global comprehensive acoustical information encapsulated in feature representations [68]. Fig. 8 shows the corresponding architecture. The xPos encoding strategy has been empirically validated to augment the stabilization and robustness of the self attention mechanism, particularly for sequences with various length.

Refer to caption
Figure 8: Architecture of the multiheaded self attention.

The xPos-based relative self attention can be formulated as follows:

RelAttn(\mathbcalx)=softmax(𝑸\mathbcalx,xPos𝑲\mathbcalx,xPosT\mathbcalDh𝑴)𝑽\mathbcalxRelAttn\mathbcal𝑥softmaxsubscript𝑸\mathbcal𝑥xPossuperscriptsubscript𝑲\mathbcal𝑥xPos𝑇\mathbcalsubscript𝐷𝑴subscript𝑽\mathbcal𝑥\textbf{RelAttn}(\mathbcal{x})=\text{softmax}\Bigg{(}\frac{\boldsymbol{Q}_{% \mathbcal{x},\rm{xPos}}\boldsymbol{K}_{\mathbcal{x},\rm{xPos}}^{T}}{\sqrt{% \mathbcal{D}_{h}}}\boldsymbol{M}\Bigg{)}\boldsymbol{V}_{\mathbcal{x}}RelAttn ( italic_x ) = softmax ( divide start_ARG bold_italic_Q start_POSTSUBSCRIPT italic_x , roman_xPos end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_x , roman_xPos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_M ) bold_italic_V start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT (21)

where 𝑸\mathbcalx,xPos=(Wq\mathbcalC+𝑸\mathbcalS)\mathbcalTsubscript𝑸\mathbcal𝑥xPossubscriptW𝑞\mathbcal𝐶subscript𝑸\mathbcal𝑆\mathbcal𝑇\boldsymbol{Q}_{\mathbcal{x},\rm{xPos}}=(\textbf{W}_{q}\mathbcal{C}+\mathfrak{% R}_{\boldsymbol{Q}}\mathbcal{S})\mathbcal{T}bold_italic_Q start_POSTSUBSCRIPT italic_x , roman_xPos end_POSTSUBSCRIPT = ( W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_C + fraktur_R start_POSTSUBSCRIPT bold_italic_Q end_POSTSUBSCRIPT italic_S ) italic_T, 𝑲\mathbcalx,xPos=(Wk\mathbcalC+𝑲\mathbcalS)\mathbcalT1subscript𝑲\mathbcal𝑥xPossubscriptW𝑘\mathbcal𝐶subscript𝑲\mathbcal𝑆\mathbcalsuperscript𝑇1\boldsymbol{K}_{\mathbcal{x},\rm{xPos}}=(\textbf{W}_{k}\mathbcal{C}+\mathfrak{% R}_{\boldsymbol{K}}\mathbcal{S})\mathbcal{T}^{-1}bold_italic_K start_POSTSUBSCRIPT italic_x , roman_xPos end_POSTSUBSCRIPT = ( W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_C + fraktur_R start_POSTSUBSCRIPT bold_italic_K end_POSTSUBSCRIPT italic_S ) italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, and 𝑽\mathbcalx=Wv\mathbcalxsubscript𝑽\mathbcal𝑥subscriptW𝑣\mathbcal𝑥\boldsymbol{V}_{\mathbcal{x}}=\textbf{W}_{v}\mathbcal{x}bold_italic_V start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_x. \mathbcalC\mathbcal𝐶\mathbcal{C}italic_C is equal to cos(mϑi)𝑚subscriptitalic-ϑ𝑖\cos(m\vartheta_{i})roman_cos ( italic_m italic_ϑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and \mathbcalS\mathbcal𝑆\mathbcal{S}italic_S is equal to sin(mϑi)𝑚subscriptitalic-ϑ𝑖\sin(m\vartheta_{i})roman_sin ( italic_m italic_ϑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which are the cosine and sine positions at the embedding dimension i𝑖iitalic_i and the time slice m𝑚mitalic_m, respectively. \mathfrak{R}fraktur_R corresponds to the rotary matrix of 𝑸𝑸\boldsymbol{Q}bold_italic_Q and 𝑲𝑲\boldsymbol{K}bold_italic_K. \mathbcalDh\mathbcalsubscript𝐷\mathbcal{D}_{h}italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is the head dimension of the attention mechanism. “T” denotes transposition. \mathbcalT=ςm,i\mathbcal𝑇subscript𝜍𝑚𝑖\mathbcal{T}=\varsigma_{m,i}italic_T = italic_ς start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT. The ςisubscript𝜍𝑖\varsigma_{i}italic_ς start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is given by:

ςi=i/\mathbcalDh2+β1+β.subscript𝜍𝑖𝑖\mathbcalsubscript𝐷2𝛽1𝛽\varsigma_{i}=\frac{i/\frac{\mathbcal{D}_{h}}{2}+\beta}{1+\beta}.italic_ς start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_i / divide start_ARG italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + italic_β end_ARG start_ARG 1 + italic_β end_ARG . (22)

where β𝛽\betaitalic_β is the optimal setting and ϑi=100002i/\mathbcalDhsubscriptitalic-ϑ𝑖superscript100002𝑖\mathbcalsubscript𝐷\vartheta_{i}=10000^{-2i/\mathbcal{D}_{h}}italic_ϑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 10000 start_POSTSUPERSCRIPT - 2 italic_i / italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. WqsubscriptW𝑞\textbf{W}_{q}W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, WksubscriptW𝑘\textbf{W}_{k}W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, WvsubscriptW𝑣\textbf{W}_{v}W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 𝑴𝑴\boldsymbol{M}bold_italic_M are trainable weighting matrices of query, key, and value of the attention mechanism, respectively.

Refer to caption
Figure 9: Architecture of the convolutional network.

Convolutional network. The convolutional network functions in capturing the local acoustic features and reinforcing the temporal causality of the feature representation. This module leverages prenorm residual connection with gating mechanisms to distill the important acoustical characteristics via pointwise and depthwise convolutional and a gated linear unit (GLU) layers [69], as illustrated in Fig. 9.

III-D3 Regression Estimation of the Room Parameters

For the regression task of room parameter estimation, we employ a parametric predictor (PP) solely for Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Tdsubscript𝑇𝑑T_{d}italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, V𝑉Vitalic_V, D𝐷Ditalic_D, and utilize both PP and an acoustical bias corrector (ABC) for θ𝜃\thetaitalic_θ, and ψ𝜓\psiitalic_ψ.

Parametric Predictor. The PP employs several convolutional layers with ReLU activation functions to compose a nonlinear regression function, allowing us to utilize the encoded representations within the latent space to predict the physically-meaningful room parameters. These parameters includes the two parameters of the SSIR model T^isubscript^𝑇𝑖\hat{T}_{i}over^ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and T^dsubscript^𝑇𝑑\hat{T}_{d}over^ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, the room volume V^^𝑉\hat{V}over^ start_ARG italic_V end_ARG, the sound source distance D^^𝐷\hat{D}over^ start_ARG italic_D end_ARG, and DOA of the sound source θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG and ψ^^𝜓\hat{\psi}over^ start_ARG italic_ψ end_ARG. The behavior of the predictor can be mathematically determined as follows:

𝜸=fpred(\mathbcalyenc)𝜸subscript𝑓pred\mathbcalsubscript𝑦𝑒𝑛𝑐\boldsymbol{\gamma}=f_{\rm{pred}}(\mathbcal{y}_{enc})bold_italic_γ = italic_f start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ) (23)

where 𝜸𝜸\boldsymbol{\gamma}bold_italic_γ denotes the room parameters output from the PP, which is a constant function alongside the time axis. During inference, we sum up and average the values over the time axis to obtain a single predicted room parameter. The overall architecture of the PP is presented in Fig. 10.

Refer to caption
Figure 10: Architecture of the PP.

Acoustical Bias Corrector. The ABC acts as a gating mechanism to differentiate between the biased and unbiased data encoded within the latent space, thereby directing the optimal signal flow into the PP and ensuring that the PP can learn from the unbiased data distribution. Additionally, we leak some biased data into the PP to make it robust to biases. The necessity of such a mechanism arises in the context of room parameters such as the sound source azimuth θ𝜃\thetaitalic_θ and elevation ψ𝜓\psiitalic_ψ, whose distributions exhibit the substantial inherent biases that are difficult to mitigate through conventional data augmentation techniques, as shown in Fig. 11. The biases often lead to the regression of trivial results, i.e., the mean of the whole distribution.

Refer to caption
Figure 11: Histogram of DOAs of the sound source.

The ABC comprises a sandwich structure, characterizing the rotary-positional self attention[70] as a feature enhancer to assign the different attention weights to all latent spectro-temporal feature representation in a frame-by-frame manner. The rotary-position encoding approach also contributes to stabilizing the training process. It is also adaptable to variable latent input lengths without the need for alignment. The output bias probability pabc(\mathbcaly^)subscript𝑝abc\mathbcal^𝑦p_{\rm{abc}}(\mathbcal{\hat{y}})italic_p start_POSTSUBSCRIPT roman_abc end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) of the ABC can be mathematically expressed as follows:

pabc(\mathbcaly^)=fcorr(\mathbcalyenc).subscript𝑝abc\mathbcal^𝑦subscript𝑓corr\mathbcalsubscript𝑦𝑒𝑛𝑐p_{\rm{abc}}(\mathbcal{\hat{y}})=f_{\rm{corr}}(\mathbcal{y}_{enc}).italic_p start_POSTSUBSCRIPT roman_abc end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG ) = italic_f start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ) . (24)
Refer to caption
Figure 12: Architecture of the ABC.

Fig. 12 depicts the entire architecture of the ABC. “GELU” and “Sigmoid” denote the GELU [71] and sigmoid activation functions, respectively.

III-D4 Classification Estimation of the Room Parameters

A FC layer is engineered to regress the encoded feature representations to physics-informed instantaneous occupancy levels as a time sequence derived from the observed crowded signals. The pivot of this architecture is substantiated by Assumption 2 (Section III-A2), which facilitates forecasting a time series from a regression task to a classification task, thus simplifying the complexity of the task while improves the robustness of the prediction. Considering that the occupancy level exhibits no significant temporal dependence, we instead simply adopt a linear layer with log-softmax activation function to predict the occupancy level rather than recurrent or autoregressive structures. The resolution of estimation process is about 62.562.562.562.5 Hz, i.e., 16161616 ms per frame, for predicting the occupancy level around the listener’s location.

III-E Joint Estimation Framework

We explore joint estimation framework for estimating multiple RAPs and RPPs simultaneously. The underlying hypothesis posits that the RAPs, which are directly derived from the RIR, share the same reverberation information encapsulated in the RIR. Additionally, the observed reverberant signal embodies crucial physical information related to volume and sound-source characteristics. Furthermore, the interdependency between the RAPs and RPPs plays a pivotal role in improving the robustness and efficacy of the estimation, which is anticipated to improve the accuracy of the joint estimation strategy to be at least similar to that of the separate strategy. These hypotheses substantiate the feasibility of develo** a universal model, which is a promising approach for efficiently analyzing the room acoustics within a unified estimation framework instead of training the multiple separate models.

The architecture of the joint estimation method is illustrated in Fig. 1. The unified and occupancy modules originate from the noisy reverberant signal ynrsubscript𝑦nry_{\rm{nr}}italic_y start_POSTSUBSCRIPT roman_nr end_POSTSUBSCRIPT and the crowded reverberant signal ycrsubscript𝑦cry_{\rm{cr}}italic_y start_POSTSUBSCRIPT roman_cr end_POSTSUBSCRIPT, respectively. Within the joint framework, the unified RFE serves as the foundational component across all PPs, facilitating the mutual exchange of interdependent information among the RAPs and RPPs in the latent space. Subsequently, each targeted room parameter is tasked with regressing a distinct function by using a dedicated predictor for desired room parameters. The configuration of each PP, as well as that of the ABC, is described in Section III-D3.

III-F Loss Function

III-F1 Loss for the Parametric Predictor

We employ the Huber loss [72] to optimize the PPs across each targeted room parameter. The Huber loss of the PPs is defined as:

pred(γ,γ^)={12𝒩n=1𝒩k=1𝒦(γnγ^n,k)2,|γnγ^n,k|δδ1𝒩n=1𝒩k=1𝒦|γnγ^n,k|12δ2,|γnγ^n,k|>δsubscriptpred𝛾^𝛾cases12𝒩superscriptsubscript𝑛1𝒩superscriptsubscript𝑘1𝒦superscriptsubscript𝛾𝑛subscript^𝛾𝑛𝑘2subscript𝛾𝑛subscript^𝛾𝑛𝑘𝛿otherwise𝛿1𝒩superscriptsubscript𝑛1𝒩superscriptsubscript𝑘1𝒦subscript𝛾𝑛subscript^𝛾𝑛𝑘12superscript𝛿2subscript𝛾𝑛subscript^𝛾𝑛𝑘𝛿otherwise\mathcal{L}_{\rm{pred}}(\gamma,\hat{\gamma})=\begin{cases}\frac{1}{2\mathcal{N% }}\sum_{n=1}^{\mathcal{N}}\sum_{k=1}^{\mathcal{K}}(\gamma_{n}-\hat{\gamma}_{n,% k})^{2},\hskip 19.0pt\lvert\gamma_{n}-\hat{\gamma}_{n,k}\rvert\leq\delta\\ \delta\frac{1}{\mathcal{N}}\sum_{n=1}^{\mathcal{N}}\sum_{k=1}^{\mathcal{K}}% \lvert\gamma_{n}-\hat{\gamma}_{n,k}\rvert-\frac{1}{2}\delta^{2},\hskip 2.0pt% \lvert\gamma_{n}-\hat{\gamma}_{n,k}\rvert>\delta\end{cases}caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_γ , over^ start_ARG italic_γ end_ARG ) = { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 caligraphic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_K end_POSTSUPERSCRIPT ( italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over^ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , | italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over^ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT | ≤ italic_δ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_δ divide start_ARG 1 end_ARG start_ARG caligraphic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_K end_POSTSUPERSCRIPT | italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over^ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT | - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , | italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over^ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT | > italic_δ end_CELL start_CELL end_CELL end_ROW (25)

where γ𝛾\gammaitalic_γ is the targeted room parameter and δ𝛿\deltaitalic_δ is set to 1. The symbol 𝒩𝒩\mathcal{N}caligraphic_N denotes the batch size, and 𝒦𝒦\mathcal{K}caligraphic_K represents the time frame length. The Huber loss possesses the dual sensitivity of the minimum-variance estimation by the \mathcal{L}caligraphic_L2 loss and the robustness of the median-aware estimation against outliers by the \mathcal{L}caligraphic_L1 loss. It also circumvents the convergence problem of the \mathcal{L}caligraphic_L1 loss on a small scale [73] and contributes to preventing exploding gradients by clip** gradients exceeding δ𝛿\deltaitalic_δ.

III-F2 Loss for the Acoustical Bias Corrector

Considering the prediction task of the ABC is binary, i.e., distinguishing between unbiased and biased data, we adopt the binary cross-entropy (BCE) for optimization, which is defined as follows:

corr=1𝒩n=1𝒩ynabclog(pabc(y^n)+(1ynabc)log(1pabc(y^n)),\mathcal{L}_{\rm{corr}}=-\frac{1}{\mathcal{N}}\sum_{n=1}^{\mathcal{N}}y^{\rm{% abc}}_{n}\cdot\log\big{(}p_{\rm{abc}}(\hat{y}_{n})+(1-y^{\rm{abc}}_{n})\\ \cdot\log\big{(}1-p_{\rm{abc}}(\hat{y}_{n})\big{)},start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG caligraphic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT roman_abc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ roman_log ( italic_p start_POSTSUBSCRIPT roman_abc end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + ( 1 - italic_y start_POSTSUPERSCRIPT roman_abc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋅ roman_log ( 1 - italic_p start_POSTSUBSCRIPT roman_abc end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (26)

where ynabcsubscriptsuperscript𝑦abc𝑛y^{\rm{abc}}_{n}italic_y start_POSTSUPERSCRIPT roman_abc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the ground-truth label presenting acoustical bias, and pabc(y^n)subscript𝑝abcsubscript^𝑦𝑛p_{\rm{abc}}(\hat{y}_{n})italic_p start_POSTSUBSCRIPT roman_abc end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) denotes the predicted bias probability output from the ABC.

III-F3 Loss for the Occupancy Module

The occupancy module utilizes the cross-entropy (CE), reflecting the multiclass nature of the occupancy level estimation task. This loss function is determined as follows:

occu=1𝒩n=1𝒩c=0𝒞Nn,clog(pcrowd(y^n,ccrowd)),subscriptoccu1𝒩superscriptsubscript𝑛1𝒩superscriptsubscript𝑐0𝒞subscript𝑁𝑛𝑐subscript𝑝crowdsubscriptsuperscript^𝑦crowd𝑛𝑐\mathcal{L}_{\rm{occu}}=-\frac{1}{\mathcal{N}}\sum_{n=1}^{\mathcal{N}}\sum_{c=% 0}^{\mathcal{C}}N_{n,c}\log\big{(}p_{\rm{crowd}}(\hat{y}^{\rm{crowd}}_{n,c})% \big{)},caligraphic_L start_POSTSUBSCRIPT roman_occu end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG caligraphic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_C end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_n , italic_c end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT roman_crowd end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT roman_crowd end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_c end_POSTSUBSCRIPT ) ) , (27)

and 𝒞𝒞\mathcal{C}caligraphic_C, the upper bound of the occupancy level, is set to 12 according to Assumption 2 (detailed in Section III-A2). Here, y^n,ccrowdsubscriptsuperscript^𝑦𝑐𝑟𝑜𝑤𝑑𝑛𝑐\hat{y}^{crowd}_{n,c}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_c italic_r italic_o italic_w italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_c end_POSTSUBSCRIPT represents the logits output from the FC, pcrowdsubscript𝑝crowdp_{\rm{crowd}}italic_p start_POSTSUBSCRIPT roman_crowd end_POSTSUBSCRIPT denotes the output probability after softmax function, and Nn,csubscript𝑁𝑛𝑐N_{n,c}italic_N start_POSTSUBSCRIPT italic_n , italic_c end_POSTSUBSCRIPT denotes the ground-truth instantaneous occupancy level.

III-F4 See-Saw Loss

When estimating the sound source azimuth and elevation, the ABC is deployed to counteract significant bias inherent within the data distribution. Nevertheless, a critical issue arises from the disparate gradient descent rates of the two employed loss functions (the Huber and BCE losses). The gradient descent rate for the BCE loss is significantly faster than that for the Huber loss, which causes the training instability, specifically when the BCE loss approaches overfitting whereas the Huber loss still remains underfitting.

Therefore, we introduce a new loss function, namely, the see-saw loss, to solve this disparity. This loss function can adaptively balance the gradient descent rates of BCE and Huber losses, thus stabilizing the training process. The see-saw loss function devised for DOA estimation is formulated as follows:

seesaw(θ;ψ,θ^;ψ^)=𝔴corr(corraz+correlev)+𝔴pred[𝔴predpred(θ,θ^)+𝔴pred′′pred(ψ,ψ^)]1+𝔴corr′′(corraz+correlev),subscriptseesaw𝜃𝜓^𝜃^𝜓superscriptsubscript𝔴corrsubscriptsuperscriptazcorrsubscriptsuperscriptelevcorrsubscript𝔴preddelimited-[]subscriptsuperscript𝔴predsubscriptpred𝜃^𝜃subscriptsuperscript𝔴′′predsubscriptpred𝜓^𝜓1subscriptsuperscript𝔴′′corrsubscriptsuperscriptazcorrsubscriptsuperscriptelevcorr\mathcal{L}_{\rm{see-saw}}(\theta;\psi,\hat{\theta};\hat{\psi})=\mathfrak{w}_{% \rm{corr}}^{\prime}(\mathcal{L}^{\rm{az}}_{\rm{corr}}+\mathcal{L}^{\rm{elev}}_% {\rm{corr}})\\ +\frac{\mathfrak{w}_{\rm{pred}}[\mathfrak{w}^{\prime}_{\rm{pred}}\mathcal{L}_{% \rm{pred}}(\theta,\hat{\theta})+\mathfrak{w}^{\prime\prime}_{\rm{pred}}% \mathcal{L}_{\rm{pred}}(\psi,\hat{\psi})]}{1+\mathfrak{w}^{\prime\prime}_{\rm{% corr}}(\mathcal{L}^{\rm{az}}_{\rm{corr}}+\mathcal{L}^{\rm{elev}}_{\rm{corr}})},start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT roman_see - roman_saw end_POSTSUBSCRIPT ( italic_θ ; italic_ψ , over^ start_ARG italic_θ end_ARG ; over^ start_ARG italic_ψ end_ARG ) = fraktur_w start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUPERSCRIPT roman_az end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT + caligraphic_L start_POSTSUPERSCRIPT roman_elev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL + divide start_ARG fraktur_w start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT [ fraktur_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_θ , over^ start_ARG italic_θ end_ARG ) + fraktur_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_ψ , over^ start_ARG italic_ψ end_ARG ) ] end_ARG start_ARG 1 + fraktur_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUPERSCRIPT roman_az end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT + caligraphic_L start_POSTSUPERSCRIPT roman_elev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT ) end_ARG , end_CELL end_ROW (28)

where seesaw(θ;ψ,θ^;ψ^)subscriptseesaw𝜃𝜓^𝜃^𝜓\mathcal{L}_{\rm{see-saw}}(\theta;\psi,\hat{\theta};\hat{\psi})caligraphic_L start_POSTSUBSCRIPT roman_see - roman_saw end_POSTSUBSCRIPT ( italic_θ ; italic_ψ , over^ start_ARG italic_θ end_ARG ; over^ start_ARG italic_ψ end_ARG ) denotes the total loss. The components corrazsubscriptsuperscriptazcorr\mathcal{L}^{\rm{az}}_{\rm{corr}}caligraphic_L start_POSTSUPERSCRIPT roman_az end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT and correlevsubscriptsuperscriptelevcorr\mathcal{L}^{\rm{elev}}_{\rm{corr}}caligraphic_L start_POSTSUPERSCRIPT roman_elev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT represent the BCE losses of the azimuth and elevation through the ABC, respectively. pred(θ,θ^)subscriptpred𝜃^𝜃\mathcal{L}_{\rm{pred}}(\theta,\hat{\theta})caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_θ , over^ start_ARG italic_θ end_ARG ) and pred(ψ,ψ^)subscriptpred𝜓^𝜓\mathcal{L}_{\rm{pred}}(\psi,\hat{\psi})caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_ψ , over^ start_ARG italic_ψ end_ARG ) correspond to the Huber losses for the azimuth and elevation using PPs. 𝔴corrsubscriptsuperscript𝔴corr\mathfrak{w}^{\prime}_{\rm{corr}}fraktur_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT and 𝔴corr′′subscriptsuperscript𝔴′′corr\mathfrak{w}^{\prime\prime}_{\rm{corr}}fraktur_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT are the weight coefficients of the bias correctors. 𝔴predsubscript𝔴pred\mathfrak{w}_{\rm{pred}}fraktur_w start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT, 𝔴predsubscriptsuperscript𝔴pred\mathfrak{w}^{\prime}_{\rm{pred}}fraktur_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT, and 𝔴pred′′subscriptsuperscript𝔴′′pred\mathfrak{w}^{\prime\prime}_{\rm{pred}}fraktur_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT are the corresponding weight coefficients of the predictors.

III-F5 Polynomial See-Saw Loss for Joint Estimation

We introduce a loss function that combines polynomial losses with see-saw loss for the joint estimation framework.

The polynomial see-saw loss unifiedsubscriptunified\mathcal{L}_{\rm{unified}}caligraphic_L start_POSTSUBSCRIPT roman_unified end_POSTSUBSCRIPT is formulated as follows:

unified(Ti;Td;V;D;θ;ψ,T^i;T^d;V^;D^;θ^;ψ^)=𝔴Tipred(Ti,T^i)+𝔴Tdpred(Td,T^d)+𝔴Vpred(V,V^)+𝔴Dpred(D,D^)+seesaw(θ;ψ,θ^;ψ^),subscriptunifiedsubscript𝑇𝑖subscript𝑇𝑑𝑉𝐷𝜃𝜓subscript^𝑇𝑖subscript^𝑇𝑑^𝑉^𝐷^𝜃^𝜓subscript𝔴subscript𝑇𝑖subscriptpredsubscript𝑇𝑖subscript^𝑇𝑖subscript𝔴subscript𝑇𝑑subscriptpredsubscript𝑇𝑑subscript^𝑇𝑑subscript𝔴𝑉subscriptpred𝑉^𝑉subscript𝔴𝐷subscriptpred𝐷^𝐷subscriptseesaw𝜃𝜓^𝜃^𝜓\mathcal{L}_{\rm{unified}}(T_{i};T_{d};V;D;\theta;\psi,\hat{T}_{i};\hat{T}_{d}% ;\hat{V};\hat{D};\hat{\theta};\hat{\psi})=\\ \mathfrak{w}_{T_{i}}\mathcal{L}_{\rm{pred}}(T_{i},\hat{T}_{i})+\mathfrak{w}_{T% _{d}}\mathcal{L}_{\rm{pred}}(T_{d},\hat{T}_{d})+\mathfrak{w}_{V}\mathcal{L}_{% \rm{pred}}(V,\hat{V})\\ +\mathfrak{w}_{D}\mathcal{L}_{\rm{pred}}(D,\hat{D})+\mathcal{L}_{\rm{see-saw}}% (\theta;\psi,\hat{\theta};\hat{\psi}),start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT roman_unified end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ; italic_V ; italic_D ; italic_θ ; italic_ψ , over^ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ; over^ start_ARG italic_V end_ARG ; over^ start_ARG italic_D end_ARG ; over^ start_ARG italic_θ end_ARG ; over^ start_ARG italic_ψ end_ARG ) = end_CELL end_ROW start_ROW start_CELL fraktur_w start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + fraktur_w start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , over^ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) + fraktur_w start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_V , over^ start_ARG italic_V end_ARG ) end_CELL end_ROW start_ROW start_CELL + fraktur_w start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_D , over^ start_ARG italic_D end_ARG ) + caligraphic_L start_POSTSUBSCRIPT roman_see - roman_saw end_POSTSUBSCRIPT ( italic_θ ; italic_ψ , over^ start_ARG italic_θ end_ARG ; over^ start_ARG italic_ψ end_ARG ) , end_CELL end_ROW (29)

where 𝔴Tisubscript𝔴subscript𝑇𝑖\mathfrak{w}_{T_{i}}fraktur_w start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, 𝔴Tdsubscript𝔴subscript𝑇𝑑\mathfrak{w}_{T_{d}}fraktur_w start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT, 𝔴Vsubscript𝔴𝑉\mathfrak{w}_{V}fraktur_w start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, and 𝔴Dsubscript𝔴𝐷\mathfrak{w}_{D}fraktur_w start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT are weighting coefficients for losses pred(Ti,T^i)subscriptpredsubscript𝑇𝑖subscript^𝑇𝑖\mathcal{L}_{\rm{pred}}(T_{i},\hat{T}_{i})caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), pred(Td,T^d)subscriptpredsubscript𝑇𝑑subscript^𝑇𝑑\mathcal{L}_{\rm{pred}}(T_{d},\hat{T}_{d})caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , over^ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), pred(V,V^)subscriptpred𝑉^𝑉\mathcal{L}_{\rm{pred}}(V,\hat{V})caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_V , over^ start_ARG italic_V end_ARG ), and pred(D,D^)subscriptpred𝐷^𝐷\mathcal{L}_{\rm{pred}}(D,\hat{D})caligraphic_L start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT ( italic_D , over^ start_ARG italic_D end_ARG ), respectively. The weighting ratio is arranged as: 𝔴Ti:𝔴Td:𝔴V:𝔴D:𝔴corr:𝔴corr′′:𝔴pred:𝔴pred:𝔴pred′′=5.0:5.0:5.0:5.0:0.1:0.1:0.5:10.0:1.0:subscript𝔴subscript𝑇𝑖subscript𝔴subscript𝑇𝑑:subscript𝔴𝑉:subscript𝔴𝐷:subscriptsuperscript𝔴corr:subscriptsuperscript𝔴′′corr:subscript𝔴pred:subscriptsuperscript𝔴pred:subscriptsuperscript𝔴′′pred5.0:5.0:5.0:5.0:0.1:0.1:0.5:10.0:1.0\mathfrak{w}_{T_{i}}:\mathfrak{w}_{T_{d}}:\mathfrak{w}_{V}:\mathfrak{w}_{D}:% \mathfrak{w}^{\prime}_{\rm{corr}}:\mathfrak{w}^{\prime\prime}_{\rm{corr}}:% \mathfrak{w}_{\rm{pred}}:\mathfrak{w}^{\prime}_{\rm{pred}}:\mathfrak{w}^{% \prime\prime}_{\rm{pred}}=5.0:5.0:5.0:5.0:0.1:0.1:0.5:10.0:1.0fraktur_w start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : fraktur_w start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT : fraktur_w start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT : fraktur_w start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT : fraktur_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT : fraktur_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_corr end_POSTSUBSCRIPT : fraktur_w start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT : fraktur_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT : fraktur_w start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_pred end_POSTSUBSCRIPT = 5.0 : 5.0 : 5.0 : 5.0 : 0.1 : 0.1 : 0.5 : 10.0 : 1.0.

III-G Evaluation Metrics

We employ the mean absolute error (MAE) and the Pearson correlation coefficient (PCC) as the evaluation metrics. The MAE provides a direct measure of the scale of the average estimation error and the PCC is introduced to quantify the invariant similarity of the estimated and ground-truth values.

TABLE I: MAE and PCC comparisons among variants of the proposed BERPs with different featurizations and baselines for the room parameters derived from the estimation frameworks, i.e., the neural network models. All models were sufficiently trained until convergence. Gammatone, Mel, and MFCC denote the featurization methods.
Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Tdsubscript𝑇𝑑T_{d}italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT V𝑉Vitalic_V D𝐷Ditalic_D θ𝜃\thetaitalic_θ ψ𝜓\psiitalic_ψ N𝑁Nitalic_N
(joint) [s𝑠sitalic_s] [s𝑠sitalic_s] [log10(m3)subscript10superscript𝑚3\log_{10}(m^{3})roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )] [m𝑚mitalic_m] [rad] [rad]
MAE \downarrow Full-CNN[32, 33, 34] 1.7720 1.0050 1.3210 3.8260 3.2610 0.5966 -
CRNN[35, 36] 0.0238 0.2968 0.3189 1.8730 0.3685 0.1002 -
TAE-CNN[39] 0.0404 0.4608 0.4328 3.8700 0.3157 0.0684 -
RE-NET[38] 0.1844 1.1460 0.5102 4.6180 0.8614 0.8183 -
BERP-Gammatone 0.0030 0.0341 0.0373 0.6918 0.2451 0.0684 0.5519
BERP-Mel 0.0018 0.0221 0.0272 0.5070 0.1967 0.0695 0.5370
BERP-MFCC 0.0019 0.0264 0.0271 0.5375 0.1899 0.0734 0.5411
PCC \uparrow Full-CNN[32, 33, 34] 0.1543 0.6431 0.3268 0.5731 0.0329 0.0116 -
CRNN[35, 36] 0.6914 0.9356 0.7450 0.8512 0.1157 0.1756 -
TAE-CNN[39] - - 0.5555 0.5194 - - -
RE-NET[38] 0.2395 0.1579 0.3961 0.3285 0.0629 0.1579 -
BERP-Gammatone 0.9437 0.9929 0.9731 0.9271 0.6311 0.6936 -
BERP-Mel 0.9691 0.9971 0.9705 0.9503 0.7017 0.7342 -
BERP-MFCC 0.9667 0.9951 0.9733 0.9520 0.7325 0.7342 -
TABLE II: Evaluation results obtained for the RAPs derived from the SSIR RIR model of the proposed BERP.
STI %ALcons\%\rm{AL}_{cons}% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT T60subscript𝑇60T_{60}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT EDT C80subscript𝐶80C_{80}italic_C start_POSTSUBSCRIPT 80 end_POSTSUBSCRIPT C50subscript𝐶50C_{50}italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT D50subscript𝐷50D_{50}italic_D start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
(joint) [%percent\%%] [s𝑠sitalic_s] [s𝑠sitalic_s] [dB] [dB] [%percent\%%] [s𝑠sitalic_s]
MAE \downarrow BERP-Gammatone 0.0544 4.1388 0.0342 0.3378 2.966 3.3556 14.7659 0.0539
BERP-Mel 0.0534 4.0794 0.0221 0.3282 2.9051 3.3135 14.5699 0.0528
BERP-MFCC 0.0540 4.0877 0.0265 0.3325 2.9498 3.3418 14.6950 0.0532
PCC \uparrow BERP-Gammatone 0.9477 0.8660 0.9976 0.9870 0.9047 0.8370 0.8221 0.9772
BERP-Mel 0.9501 0.8682 0.9994 0.9892 0.9097 0.8412 0.8263 0.9802
BERP-MFCC 0.9490 0.8671 0.9960 0.9864 0.9082 0.8397 0.8251 0.9777
TABLE III: Results of an ablation study concerning separate estimation pipelines. The MAE and PCC attained by the proposed BERP and the baselines in separate estimation pipelines are presented. All the models were sufficiently trained until convergence. We used the MFCC featurization method for the BERP.
Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Tdsubscript𝑇𝑑T_{d}italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT V𝑉Vitalic_V D𝐷Ditalic_D θ𝜃\thetaitalic_θ ψ𝜓\psiitalic_ψ
(separate) [s𝑠sitalic_s] [s𝑠sitalic_s] [log10(m3)subscript10superscript𝑚3\log_{10}(m^{3})roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )] [m𝑚mitalic_m] [rad] [rad]
MAE \downarrow Full-CNN[32, 33, 34] 0.0704 0.3085 0.5282 4.8520 0.3139 0.0702
CRNN[35, 36] 0.0177 0.2927 0.1597 1.6540 0.2583 0.0701
TAE-CNN[39] 0.0341 1.7320 3.1200 7.704 0.3157 0.0683
RE-NET[38] 0.0341 0.6283 0.4963 5.2390 0.3140 0.0733
BERP 0.0025 0.0322 0.0382 0.6413 0.2074 0.0569
PCC \uparrow Full-CNN[32, 33, 34] 0.2660 0.9377 - - - -
CRNN[35, 36] 0.5202 0.9481 0.9221 0.8859 0.3397 0.2612
TAE-CNN[39] - - - - - -
RE-NET[38] 0.1159 0.6722 0.3293 0.1243 0.0341 0.0296
BERP 0.9597 0.9976 0.9641 0.9336 0.6173 0.6595

When estimating the occupancy levels as a time sequence, we choose only the MAE to quantify the Euclidean distance between the estimated and ground-truth occupancy sequences.

IV Experiments

IV-A Experimental Setup

Training strategy. We randomly split the 47,4304743047,43047 , 430 total distinct clips of noisy and crowded reverberant speech signals into three segments, training, validation, and test datasets, following the I.I.D. paradigm. We allocates 2000 clips each to the validation and test datasets and the remaining 43,430 clips are for the training dataset. The padding mask is deployed to ensure that the framework learns only the valid information across each minibatch. The RAdam optimizer with 22\mathcal{L}2caligraphic_L 2 regularization is used [74], which possesses a functionality of learning rate warmup without the risk of underfitting the regression tasks. We utilize cosine-annealing and tri-stage learning rate scheduler for unified and occupancy modules, respectively, to facilitate the convergence of the models toward the global optimums. We set a batch size of 12121212. Given the wide range of room volumes spanning from 40404040 to 9000900090009000 m3superscript𝑚3m^{3}italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, we apply logarithmic scaling to compress them, stabilizing the training process and improving model robustness. Unitary linear normalization is applied to standardize the gradient update rate to ensure a uniform descent across labels.

TABLE IV: Results of an ablation study concerning disentangling PP. The PP is replaced with a simple linear layer to investigate the contribution of the PP. The MFCC featurization is employed across all evaluations.
Architecture Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Tdsubscript𝑇𝑑T_{d}italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT V𝑉Vitalic_V D𝐷Ditalic_D
(separate)
MAE \downarrow BERP w/o PP 0.1620 1.0023 0.0508 0.5960
BERP 0.0025 0.0322 0.0382 0.6413
PCC \uparrow BERP w/o PP 0.6013 0.9211 0.9554 0.9343
BERP 0.9579 0.9948 0.9641 0.9336
(joint)
MAE \downarrow BERP w/o PP 0.1174 1.1707 0.4173 8.4852
BERP 0.0019 0.0264 0.0271 0.5375
PCC \uparrow BERP w/p PP 0.6219 0.7821 0.7626 0.6724
BERP 0.9667 0.9951 0.9733 0.9520

Featurizer configuration. We set a uniform configuration for all spectrogram-variant featurizers. They each contain the same 128 Gammatone, mel filterbank, and DCT bins channels; windowing with size of 1024102410241024; and a 75%percent7575\%75 % overlap** rate.

Baselines. In our comparative experiments, we evaluated the performance of our proposed method in comparison with four baseline architectures that are renowned in the domain of room parameter estimation amidst background noise: the Full-CNN [32, 33, 34], the CRNN [35, 36], the TAE-CNN [39], and the RE-NET [38]. These SOTA frameworks were deployed in both joint and separate estimation tasks.

IV-B Results

IV-B1 Evaluation of the Room Parameters Derived from Frameworks

We evaluated the proposed BERP and the baseline frameworks by using the same dataset as detailed in Section III-C1 and the same data segmentation setting for the joint estimation of the room parameters Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Tdsubscript𝑇𝑑T_{d}italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, V𝑉Vitalic_V, D𝐷Ditalic_D, θ𝜃\thetaitalic_θ, ψ𝜓\psiitalic_ψ and N𝑁Nitalic_N output from the trained models.

Table I shows the estimation accuracies achieved by the BERP across three featurizations, alongside a comparison with the baselines. The BERP significantly outperforms the SOTA architectures in terms of the MAE and PCC evaluation metrics. Even for parameters such as the azimuth θ𝜃\thetaitalic_θ and elevation ψ𝜓\psiitalic_ψ, which are subject to significant data distribution biases, the BERP maintains its effectiveness. Moreover, the performance comparison among the three featurizers indicates that the MFCC featurizer yields the most favorable outcomes, which supports our assertion regarding the intrinsic relevance of MFCC to room acoustics, highlighting its fitness to blind estimation of room parameter.

IV-B2 Evaluation of the Room Acoustic Parameters Using the SSIR Model

Table II shows the estimation results obtained for the RAPs derived from the synthesized RIR using the SSIR RIR model. These results indicate the effectiveness of the SSIR for modeling realistic RIRs and subsequently deriving RAPs, highlighting the ability of the SSIR model to capture the essence of real-world RIRs for the precise estimation of RAPs. Specifically, when applied to mel spectrogram featurizer, the BERP achieves better performance.

TABLE V: Results of an ablation study concerning the efficacy of the ABC. We utilize BERPs with and without the ABC to investigate the efficiency of the ABC for used in the orientation module only. The featurization method is MFCC. The ABC significantly improves the task of regressing azimuth and elevation parameters with inherent distribution biases.
Orientation Module θ𝜃\thetaitalic_θ ψ𝜓\psiitalic_ψ
PCC \uparrow BERP w/o ABC 0.2724 0.6099
BERP 0.6173 0.6595
MAE \downarrow BERP w/o ABC 0.2823 0.0574
BERP 0.2268 0.0574

IV-B3 Ablation Study

Separate estimation pipelines. To further investigate the efficacy of our specifically designed RFE, PP, and ABC to estimate the RAPs and RPPs, we employed the separate estimation tasks. It is also conducted to verify our hypothesis asserted in Section III-E, in which the unified encoder promotes the effectiveness of the estimation. This ablation study comprises four separate pipelines, each of which is dedicated to map** the observed speech signals to the corresponding target RAPs and RPPs, which include the RIR, volume, distance, and orientation modules. The RIR module concurrently estimates the two parameters Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Tdsubscript𝑇𝑑T_{d}italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT of the SSIR model. The volume and distance modules estimates the room volume V𝑉Vitalic_V and the sound source distance D𝐷Ditalic_D, respectively. Finally, the orientation module tests the task of simultaneously estimating the DOA of the sound source, i.e., θ𝜃\thetaitalic_θ and ψ𝜓\psiitalic_ψ. The results are shown in Table III, showing that the BERP also significantly outperforms the current methods. These results indicate that the proposed framework is effective even though in separate estimations. They also substantiate our hypothesis that joint estimation enhances the estimation accuracy via the mutual interdependence of room parameters and facilitates the sufficient and efficient learning for the neural networks.

Without the parametric predictor. To dissect the contribution of the PP to the overall performance of the BERP, we conducted the ablation study of dissecting the PP. We employed the separate pipelines for the three modules, RIR, volume, and distance modules. The joint framework is also tested by discarding the estimation of the DOA since the orientation module integrates the ABC. We consistently used the MFCC featurizer. Table IV compares the results obtained using solely the RFE with those achieved by using the full architecture equipped with the PP. The results show that the PP contributes significantly to the performance of the BERP, especially in the joint estimation.

Without the acoustical bias corrector. To understand the efficacy of the ABC, we conducted an ablation study with or without this bias corrector in terms of estimating the sound source azimuth θ𝜃\thetaitalic_θ and elevation ψ𝜓\psiitalic_ψ. Importantly, the PCC is much more representative than the MAE for evaluating the performance achieved on datasets with biased data distributions. Table V indicates that the ABC significantly mitigates the intrinsic bias within the dataset, proving the efficacy of the ABC for use with substantially biased data distributions.

V Conclusion

We propose the BERP, a universal blind estimation framework designed for simultaneously estimating several RAPs and RPPs, i.e., speech transmission index (STI), articulation loss of consonants (%ALcons\%\rm{AL}_{\rm{cons}}% roman_AL start_POSTSUBSCRIPT roman_cons end_POSTSUBSCRIPT), reverberation time (T60subscript𝑇60T_{60}italic_T start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT), early decay time (EDT), clarity (C80subscript𝐶80C_{80}italic_C start_POSTSUBSCRIPT 80 end_POSTSUBSCRIPT and C50subscript𝐶50C_{50}italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT), definition (D50subscript𝐷50D_{50}italic_D start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT), center time (Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT), room volume (V𝑉Vitalic_V), sound source distance D𝐷Ditalic_D, DOA of the sound source (θ𝜃\thetaitalic_θ and ψ𝜓\psiitalic_ψ), and instantaneous occupancy level (N𝑁Nitalic_N). The BERP provides a new paradigm for blind estimation in room acoustics. This framework can blindly evaluate the RAPs and RPPs simultaneously within a wide range of realistic acoustical environments to parameterize the listener’s local RACs, promising it has a wide variety of applications in room acoustics, hearing aids, communications, and human-machine interactions [14, 16, 17, 21, 22, 24, 19, 25, 15, 5, 23, 13, 75, 26]. We incorporate a new stochastic RIR model, namely, the SSIR model, to realize the concurrent and efficient estimation of RAPs without increasing the computational complexity of the framework. This scheme avoids the use of complicated optimization processes across the significant disparity of the values of the different RAPs. Moreover, the BERP fills the gap in the domain, i.e., the lack of a universal framework for blindly estimating these room parameters from single-channel noisy speech signals, especially for the sound source distance, DOA of the sound source, and instantaneous occupancy level. The evaluation results show that the proposed BERP framework greatly outperforms the current methods and achieves SOTA performance by simultaneously estimating thirteen room-acoustics-related parameters for the first time.

Regarding the limitations of this study and future work, importantly, except for occupancy level estimation, the BERP assumes a dynamic-movement, single-source speech signal as the observed input. Future research will aim to address the blind estimation of RAPs and RPPs for multisource speech signals by develo** a unified signal model that can accommodate both noisy and crowded reverberant signals in real-world environments. This extension will further expand the applicability of the proposed framework to more complex realistic acoustic scenarios.

VI Acknowledgements

We appreciate the great help from and beneficial discussions with **an Chen for this work.

References

  • [1] M. Barron, Auditorium Acoustics and Architectural Design (2nd ed.). London: Routledge, 2009.
  • [2] A. Tsilfidis, I. Mporas, J. Mourjopoulos, and N. Fakotakis, “Automatic speech recognition performance in different room acoustic environments with and without dereverberation preprocessing,” Computer Speech & Language, vol. 27, no. 1, pp. 380–395, 2013. Special issue on Paralinguistics in Naturalistic Speech and Language.
  • [3] T. Jenrungrot, V. Jayaram, S. Seitz, and I. Kemelmacher-Shlizerman, “The cone of silence: Speech separation by localization,” in Advances in Neural Information Processing Systems, 2020.
  • [4] S. E. Chazan, H. Hammer, G. Hazan, J. Goldberger, and S. Gannot, “Multi-microphone speaker separation based on deep doa estimation,” 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5, 2019.
  • [5] J.-M. Jot and K. S. Lee, “Augmented reality headphone environment rendering,” in Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality, Sep 2016.
  • [6] J. van der Werff and D. de Leeuw, “What you specify is what you get (part 1),” in Audio Engineering Society Convention 114, Mar 2003.
  • [7] S. V. Amengual Garí, W. Lachenmayr, and E. Mommertz, “Spatial analysis and auralization of room acoustics using a tetrahedral microphone,” The Journal of the Acoustical Society of America, vol. 141, pp. EL369–EL374, 04 2017.
  • [8] C. Chen, U. Jain, C. Schissler, S. V. A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson, and K. Grauman, “Soundspaces: Audio-visual navigation in 3d environments,” in Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI, (Berlin, Heidelberg), p. 17–36, Springer-Verlag, 2020.
  • [9] A. Xenaki, J. Bünsow Boldt, and M. Græsbøll Christensen, “Sound source localization and speech enhancement with sparse Bayesian learning beamforming,” The Journal of the Acoustical Society of America, vol. 143, pp. 3912–3921, 06 2018.
  • [10] IEC 60268-16:2020, Sound system equipment - part 16: Objective rating of speech intelligibility by speech transmission index. 2020.
  • [11] V. M. A. Peutz and W. Kelin, “Articulation loss of consonants influenced by noise,” Reverberation and Echo,” (in Dutch), vol. 28, pp. 11–18, Acoust. Soc. Netherlands.
  • [12] ISO 3382:2009, Acoustics - measurements of room acoustics parameters - part 1: Performance spaces. 2009.
  • [13] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, S. Gannot, and B. Raj, “The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4, 2013.
  • [14] L. Frenkel, S. E. Chazan, and J. Goldberger, “Domain adaptation using suitable pseudo labels for speech enhancement and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1226–1236, 2024.
  • [15] H. Morgenstern and B. Rafaely, “Spatial reverberation and dereverberation using an acoustic multiple-input multiple-output system,” Journal of the Audio Engineering Society, vol. 65, p. 42–55, Feb. 2017.
  • [16] T. Gajecki and W. Nogueira, “A fused deep denoising sound coding strategy for bilateral cochlear implants,” IEEE Transactions on Biomedical Engineering, pp. 1–11, 2024.
  • [17] E. P. Reynders, J. Van den Wyngaert, M. Verlinden, and G. Vermeir, “Development and performance assessment of sound absorbing chandeliers for reverberation control and improved verbal communication in large rooms,” Applied Acoustics, vol. 218, p. 109874, 2024.
  • [18] D. Fogerty, A. Alghamdi, and W.-Y. Chan, “The effect of simulated room acoustic parameters on the intelligibility and perceived reverberation of monosyllabic words and sentences,” The Journal of the Acoustical Society of America, vol. 147, pp. EL396–EL402, 05 2020.
  • [19] B. Eurich, T. Klenzner, and M. Oehler, “Impact of room acoustic parameters on speech and music perception among participants with cochlear implants,” Hearing Research, vol. 377, pp. 122–132, 2019.
  • [20] H.-Y. Lee, J.-W. Cho, M. Kim, and H.-M. Park, “Dnn-based feature enhancement using doa-constrained ica for robust speech recognition,” IEEE Signal Processing Letters, vol. 23, no. 8, pp. 1091–1095, 2016.
  • [21] G. Yenduri, R. M, P. K. R. Maddikunta, T. R. Gadekallu, R. H. Jhaveri, A. Bandi, J. Chen, W. Wang, A. A. Shirawalmath, R. Ravishankar, and W. Wang, “Spatial computing: Concept, applications, challenges and future directions,” 2024.
  • [22] H. M. Kamdjou, D. Baudry, V. Havard, and S. Ouchani, “Resource-constrained extended reality operated with digital twin in industrial internet of things,” IEEE Open Journal of the Communications Society, vol. 5, pp. 928–950, 2024.
  • [23] J. Nikunen and T. Virtanen, “Direction of arrival based spatial covariance model for blind sound source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 3, pp. 727–739, 2014.
  • [24] A. Taghipour, S. Athari, A. Gisladottir, T. Sievers, and K. Eggenschwiler, “Room acoustical parameters as predictors of acoustic comfort in outdoor spaces of housing complexes,” Frontiers in Psychology, vol. 11, p. 344, 03 2020.
  • [25] H. Dong and C. Lee, “Speech intelligibility improvement in noisy reverberant environments based on speech enhancement and inverse filtering,” J AUDIO SPEECH MUSIC PROC., vol. 3, 2018.
  • [26] X. Li, L. Girin, F. Badeig, and R. Horaud, “Reverberant sound localization with a robot head based on direct-path relative transfer function,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), p. 2819–2826, IEEE Press, 2016.
  • [27] H. Kuttruff, Room Acoustics. Taylor & Francis, 2016.
  • [28] L. Wang, S. Duangpummet, and M. Unoki, “Blind estimation of speech transmission index and room acoustic parameters by using extended model of room impulse response derived from speech signals,” IEEE Access, vol. 11, pp. 49431–49444, 2023.
  • [29] S. S. Kushwaha, I. R. Roman, M. Fuentes, and J. P. Bello, “Sound source distance estimation in diverse and dynamic acoustic conditions,” in 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5, 2023.
  • [30] P.-A. Grumiaux, S. Kitić, L. Girin, and A. Guérin, “A survey of sound source localization with deep learning methods,” The Journal of the Acoustical Society of America, vol. 152, pp. 107–151, 07 2022.
  • [31] C. Molnar and T. Freiesleben, Supervised Machine Learning For Science. 2024.
  • [32] C. Ick, A. Mehrabi, and W. **, “Blind acoustic room parameter estimation using phase features,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.
  • [33] A. F. Genovese, H. Gamper, V. Pulkki, N. Raghuvanshi, and I. J. Tashev, “Blind room volume estimation from single-channel noisy speech,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 231–235, 2019.
  • [34] H. Gamper and I. J. Tashev, “Blind reverberation time estimation using a convolutional neural network,” in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140, 2018.
  • [35] P. S. López, P. Callens, and M. Cernak, “A universal deep room acoustics estimator,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 356–360, 2021.
  • [36] P. Callens and M. Cernak, “Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks,” 2020.
  • [37] J. Eaton, N. Gaubitch, A. Moore, and P. Naylor, “Estimation of room acoustic acparameters: The ace challenge,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, pp. 1–1, 06 2016.
  • [38] K. Zheng, C. Zheng, J. Sang, Y. Zhang, and X. Li, “Noise-robust blind reverberation time estimation using noise-aware time–frequency masking,” Measurement, vol. 192, p. 110901, 2022.
  • [39] S. Duangpummet, J. Karnjana, W. Kongprawechnon, and M. Unoki, “Blind estimation of speech transmission index and room acoustic parameters based on the extended model of room impulse response,” Applied Acoustics, vol. 185, p. 108372, 2022.
  • [40] J. Traer and J. H. McDermott, “Statistics of natural reverberation enable perceptual separation of sound and space,” Proceedings of the National Academy of Sciences, vol. 113, no. 48, pp. E7856–E7865, 2016.
  • [41] C. Christensen, G. Koutsouris, and J. Rindel, “The iso 3382 parameters: Can we simulate them? can we measure them?,” vol. 20, 06 2013.
  • [42] R. Kliper, H. Kayser, D. Weinshall, I. Nelken, and J. Anemüller, “Monaural azimuth localization using spectral dynamics of speech,” in Proc. Interspeech 2011, pp. 33–36, 2011.
  • [43] R. Takashima, T. Takiguchi, and Y. Ariki, “Single-channel multi-talker-localization based on maximum likelihood,” in 2009 IEEE/SP 15th Workshop on Statistical Signal Processing, pp. 461–464, 2009.
  • [44] F. Toole, Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms. Audio Engineering Society Presents, Taylor & Francis, 2017.
  • [45] S. Cerdá, A. Giménez, J. Romero, R. Cibrián, and J. Miralles, “Room acoustical parameters: A factor analysis approach,” Applied Acoustics, vol. 70, no. 1, pp. 97–109, 2009.
  • [46] M. Queiroz, F. Iazzetta, F. Kon, M. H. A. Gomes, F. L. Figueiredo, B. Masiero, L. K. Ueda, L. Dias, M. H. C. Torres, and L. F. Thomaz, “Acmus: An open, integrated platform for room acoustics research - journal of the brazilian computer society,” 2013.
  • [47] T. Houtgast and H. J. M. Steeneken, “The modulation transfer function in room acoustics as a predictor of speech intelligibility,” The Journal of the Acoustical Society of America, vol. 54, no. 2, pp. 557–557, 1973.
  • [48] H. J. M. Steeneken and T. Houtgast, “A physical method for measuring speech‐transmission quality,” The Journal of the Acoustical Society of America, vol. 67, no. 1, pp. 318–326, 1980.
  • [49] T. Houtgast and H. J. M. Steeneken, “A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria,” The Journal of the Acoustical Society of America, vol. 77, pp. 1069–1077, 03 1985.
  • [50] O. Shih and A. Rowe, “Occupancy estimation using ultrasonic chirps,” in Proceedings of the ACM/IEEE Sixth International Conference on Cyber-Physical Systems, ICCPS ’15, (New York, NY, USA), p. 149–158, Association for Computing Machinery, 2015.
  • [51] H. Qian, G. Zhenhao, and L. Chao, “Occupancy estimation in smart buildings using audio-processing techniques,” in International Conference on Computing in Civil and Building Engineering (ICCCBE) 2016, 2016 Fall.
  • [52] A. Ebadat, G. Bottegal, D. Varagnolo, B. Wahlberg, H. Hjalmarsson, and K. H. Johansson, “Blind identification strategies for room occupancy estimation,” in 2015 European Control Conference (ECC), pp. 1315–1320, 2015.
  • [53] M. West, “The sound attenuation in an open-plan office,” Applied Acoustics, vol. 6, no. 1, pp. 35–56, 1973.
  • [54] P. Somervuo, P. Lauha, and T. Lokki, “Effects of landscape and distance in automatic audio based bird species identification,” The Journal of the Acoustical Society of America, vol. 154, pp. 245–254, 07 2023.
  • [55] R. Badeau, “Common mathematical framework for stochastic reverberation models,” The Journal of the Acoustical Society of America, vol. 145, pp. 2733–2745, 04 2019.
  • [56] M. R. Schroeder and K. H. Kuttruff, “On Frequency Response Curves in Rooms. Comparison of Experimental, Theoretical, and Monte Carlo Results for the Average Frequency Spacing between Maxima,” The Journal of the Acoustical Society of America, vol. 34, pp. 76–80, 01 1962.
  • [57] J.-D. Polack, “Playing billiards in the concert hall: The mathematical foundations of geometrical room acoustics,” Applied Acoustics, vol. 38, no. 2, pp. 235–244, 1993.
  • [58] M. R. Schroeder, “Modulation transfer functions: Definition and measurement,” Acta Acustica united with Acustica, vol. 49, no. 3, pp. 179–182, 1981.
  • [59] K. Prawda, S. J. Schlecht, and V. Välimäki, “Calibrating the Sabine and Eyring formulas,” The Journal of the Acoustical Society of America, vol. 152, pp. 1158–1169, 08 2022.
  • [60] G. Götz, S. J. Schlecht, and V. Pulkki, “A dataset of higher-order ambisonic room impulse responses and 3d models measured in a room with varying furniture,” in 2021 Immersive and 3D Audio: from Architecture to Automotive (I3DA), pp. 1–8, 2021.
  • [61] I. Szöke, M. Skácel, L. Mošner, J. Paliesek, and J. Černocký, “Building and evaluation of a real room impulse response dataset,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 863–876, 2019.
  • [62] D. T. Murphy and S. Shelley, “Openair: An interactive auralization web resource and database,” in Audio Engineering Society Convention 129, Nov 2010.
  • [63] J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” Proceedings of Meetings on Acoustics, vol. 19, p. 035081, 05 2013.
  • [64] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015.
  • [65] P. Srivastava, A. Deleforge, and E. Vincent, “Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators,” in 2022 International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 1–5, 2022.
  • [66] A. Gulati, C.-C. Chiu, J. Qin, J. Yu, N. Parmar, R. Pang, S. Wang, W. Han, Y. Wu, Y. Zhang, and Z. Zhang, eds., Conformer: Convolution-augmented Transformer for Speech Recognition, 2020.
  • [67] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, OpenReview.net, 2018.
  • [68] Y. Sun, L. Dong, B. Patra, S. Ma, S. Huang, A. Benhaim, V. Chaudhary, X. Song, and F. Wei, “A length-extrapolatable transformer,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (A. Rogers, J. Boyd-Graber, and N. Okazaki, eds.), (Toronto, Canada), pp. 14590–14604, Association for Computational Linguistics, 2023.
  • [69] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, p. 933–941, JMLR.org, 2017.
  • [70] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024.
  • [71] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” 2023.
  • [72] P. J. Huber, “A robust version of the probability ratio test,” Annals of Mathematical Statistics, vol. 36, pp. 1753–1758, 1965.
  • [73] L. Ciampiconi, A. Elwood, M. Leonardi, A. Mohamed, and A. Rozza, “A survey and taxonomy of loss functions in machine learning,” 2023.
  • [74] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” 2021.
  • [75] I.-J. Jung and J.-G. Ih, “Distance estimation of a sound source using the multiple intensity vectors,” The Journal of the Acoustical Society of America, vol. 148, pp. EL105–EL111, 07 2020.