BERP: A Blind Estimator of Room Acoustic and Physical Parameters for Single-Channel Noisy Speech Signals

Lijun Wang 1 , Yixian Lu 1, Ziyan Gao , Kai Li , Jianqiang Huang , Yuntao Kong , and Shogo Okada L. Wang, Y. Lu, Z. Gao, K. Li, J. Huang, Y. Kong and S. Okada are with School of Information Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan. email: {lijun.wang, ziyan-g, kai-li, jq.huang, okada-s, yuntao.kong}@jaist.ac.jp (Corresponding author: [email protected])Y. Lu is with ACES, Inc., Tokyo, Japan. email: [email protected] work was partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI (grant numbers 23H03506).1Equal Contribution.¹¹footnotemark: 11 The code and weights are available at https://github.com/Alizeded/BERP.

Abstract

Room acoustic parameters (RAPs) and room physical parameters ( RPPs) are essential metrics for parameterizing the room acoustical characteristics (RAC) of a sound field around a listener’s local environment, offering comprehensive indications for various applications. The current RAPs and RPPs estimation methods either fall short of covering broad real-world acoustic environments in the context of real background noise or lack universal frameworks for blindly estimating RAPs and RPPs from noisy single-channel speech signals, particularly sound source distances, direction-of-arrival (DOA) of sound sources, and occupancy levels. On the other hand, in this paper, we propose a novel universal blind estimation framework called the blind estimator of room acoustical and physical parameters (BERP), by introducing a new stochastic room impulse response (RIR) model, namely, the sparse stochastic impulse response (SSIR) model, and endowing the BERP with a unified encoder and multiple separate predictors to estimate RPPs and SSIR parameters in parallel. This estimation framework enables the computationally efficient and universal estimation of room parameters by solely using noisy single-channel speech signals. Finally, all the RAPs can be simultaneously derived from the RIRs synthesized from SSIR model with the estimated parameters. To evaluate the effectiveness of the proposed BERP and SSIR models, we compile a task-specific dataset from several publicly available datasets. The results reveal that the BERP achieves state-of-the-art (SOTA) performance. Moreover, the evaluation results pertaining to the SSIR RIR model also demonstrated its efficacy. The code is available on GitHub ²²footnotemark: 2.

Index Terms:

Room acoustics, Room impulse response, Blind estimation, Reverberation time, Room acoustic parameters, Attention mechanism

I Introduction

Room acoustical characteristics (RACs) characterize the room acoustical properties through which people perceive the sound in an enclosure. RACs determine how intelligibly and clearly people perceive sound in an auditory space encompassed by the walls, ceilings, and furnishings. For instance, concert halls require clear and transparent sounds for music appreciation, whereas lecture rooms pursue intelligible delivery for lectures and public addresses. General auditoriums require the intelligible and easily audible sounds [1]. Local RACs are widely employed in speech enhancement, hearing aids, immersive audio, context-aware renderings (such as mixed reality and augmented reality), public address systems, and robotic systems. The dynamic parameterization of local RACs poses a significant challenge in room acoustics, given the interference caused by background environmental noise.

While a room impulse response (RIR) can fully represent a listener’s local RACs, it does not provide a direct interpretation of how the human perceives their local RACs, i.e., the subjective perception of the local RACs. Since speech intelligibility and sound clarity are subjective perceptions, listening experiments are typically conducted to assess them. However, conducting listening experiments is expensive and time-consuming, making them impractical to apply in public spaces [2]. Additionally, physical geometry-related information, such as room volumes, the distances of sound sources, and the corresponding orientations, which have critical applications in spatial audio rendering, intelligibility assessments in a room, sound source separation, audio navigation system, and speech enhancement [3, 4, 5, 6, 7, 8, 9], is lacking. Consequently, room acoustic parameters (RAPs) and room physical parameters (RPPs) have been used to model local RACs to offer clear and comprehensive indications for various applications, such as room acoustical assessment [10, 11, 12, 7, 13], speech enhancement [14, 15, 9], hearing aids [16, 17, 18, 19, 6, 20], sound source separation [3, 4], spatial audio rendering [21], context-aware rendering in extended reality (XR) and augmented reality (AR) [22, 5, 23], public address systems [24, 25], and robotics [26].

A few RAPs have been investigated and standardized [27, 11, 10, 12]. In IEC60268-16:2020, the speech transmission index (STI) is used to predict the speech intelligibility of an enclosure. The percentage articulation loss of consonants ( $\%\rm{AL}_{\rm{cons}}$ ) [11] was studied to compensate for the limitations of STI, which has difficulty in reflecting the effect of linguistic information on the perception of intelligibility. The essential RAPs and their corresponding measurements, including reverberation time ( $T_{60}$ ), early decay time (EDT), clarity ( $C_{80}$ / $C_{50}$ ), definition ( $D_{50}$ ), and center time ( $T_{s}$ ), have been standardized in ISO 3382-1:2009 [12]. $T_{60}$ is the most essential RAP for representing the RACs of an enclosure. RAPs can be directly derived from the measured RIR. RIR measurement necessitates excluding the people located in an enclosure, which is impractical for public spaces since RIR measurement requires high-energy sound [28]. Furthermore, RIR measurement is constrained with respect to capturing the dynamics of the local RACs, which vary according to the locations, arrangements, and quantities of the objects and occupants that are present. The RAPs measured using specific standards may differ from noncompliant measurements employed within the same enclosure. Therefore, a blind RAP estimation method is imperative, particularly in public spaces where people cannot be excluded. Several RPPs have been studied [27, 5, 6, 29, 30], such as the room volume, the sound source distance, and direction-of-arrival (DOA) of the sound source. The room volume is closely related to the RACs [27, 5, 6]. It may be derived from the measured RIR, but this process encounters the aforementioned issues. Moreover, the sound source distance and DOA are observer-dependent parameters. As a result, blind estimation methods have been proposed to obtain the RAPs and RPPs from observed signals. Blind estimation is a challenging task since it is an ill-posed problem that derives a system solely from an output without prior knowledge of the input.

Deep learning techniques are well-suited for constructing complex map**s between high-dimensional data acquired from messy realistic environments, often without explicit indications of relevance [31]. Hence, the common approach of blind estimation is to establish a map** from the observed signals to the output using deep learning techniques.

In terms of blind RAP estimation in scenarios with background noise, deep learning techniques are currently at the forefront of this field. Several methods utilizing fully convolutional neural networks (CNN) have achieved blind reverberation time and room volume estimation from the Gammatonegrams of single-channel noisy speech signals [32, 33] by leveraging the network architecture initially developed by Gamper and Tashev [34]. Furthermore, López et al. [35] and Callens et al. [36] introduced the convolutional recurrent neural networks (CRNN) architecture, which performed best in the ACE challenge [37], for universally estimating the reverberation time, clarity, and direct-to-reverberation ratio from the mel frequency cepstral coefficients (MFCC) of single-channel noisy speeches. Zheng et al. [38] proposed a CNN method with a gating mechanism that was designed for reverberation time estimation in noisy conditions using the spectrogram of the observed signal. Duangpummet et al. [39] developed a TAE-CNN architecture using the temporal amplitude envelope (TAE) of the observed signal, enabling the concurrent estimation of STI, reverberation time, clarity, definition, and center time. For the blind estimation of RPPs, a fully CNN architecture was employed to estimate room volumes from single-channel speech signals in [32, 33]. Additionally, the CRNN architecture was deployed to estimate sound source distances and the DOAs of sound sources from multi-channel speech signals [29].

The current methods, however, either fall short of covering a sufficiently broad RIR range to accommodate real-world scenarios to the greatest extent possible or rely on widely used image-source-based synthetic RIRs. It is difficult for these synthetic RIRs to accurately reflect complex real-world room geometries, and they do not emulate the natural decay properties of realistic RIRs, resulting in impacts on human perception [40]. Given that RAPs are employed to objectively assess human perceptions of RACs, the use of synthetic RIRs may introduce biases in perception evaluation [41]. Second, it is efficient to use limited data to train a universal architecture that is capable of simultaneously estimating RAPs and RPPs in a unified methodology, especially for instantaneous occupancy levels. Furthermore, to our best knowledge, no learning-based schemes are available for the blind estimation of the sound source distances and DOAs of sound sources from single-channel speech signals. Since it has been reported that single-channel acoustic cues can be used to estimate the DOAs [42, 43], it is possible to estimate sound source distances from the same cues by using deep learning techniques.

These gaps motivate us to propose a new method, a blind estimator of room acoustic and physical parameters (BERP), that can blindly estimate room parameters universally in various real-world acoustic environments with background environmental noise. We integrate a sparse stochastic impulse response (SSIR) model, a new stochastic RIR model, into the process of map** the desired RAPs and the observed speech signals. This RIR model fuses the distinct statistic properties, i.e., the sparse and dense statistical properties of different segments of realistic RIRs, to model realistic RIRs more accurately. The SSIR model can help simultaneously derive all RAPs without introducing additional complexity to the trainable model, by establishing only the map** between the parameters of the SSIR and the observed signals. Conversely, we directly establish the relationship between the RPPs and the observed signals by using neural networks.

Our work makes three important contributions to the temporary knowledge frontier, as follows:

•

A new stochastic RIR model is proposed to effectively model the realistic RIRs in terms of simultaneous RAP derivations.
•

Signal models for estimating observed speech signals at a listener’s local, especially the occupancy level estimation, and corresponding data synthesis pipelines are proposed.
•

A new universal blind estimation framework for blindly estimating RAPs and RPPs in parallel is proposed, which achieves state-of-the-art (SOTA) performance.

The rest of the paper is composed as follows: Section II briefs RAPs and RPPs. The proposed method is introduced in Section III and the corresponding experimental settings is outlined in Section IV. We discuss and conclude in Section V.

II Room Parameters

II-A Room Acoustic Parameters

Several RAPs that describes the RACs of an auditory space have been investigated and standardized [44, 27, 45, 46, 41, 10, 12]. The parameters that are widely used to parameterize room acoustics by audio engineers are briefly introduced.

II-A1 Intelligibility Parameters

Intelligibility parameters, including the STI and $\%\rm{AL}_{\rm{cons}}$ , are used to predict speech intelligibility and assess verbal comprehension in a sound field.

Speech intelligibility index. The STI is employed to predict speech intelligibility and the corresponding listening difficulty in noisy surroundings. Houtgast and Steeneken initially defined the STI based on the modulation transfer function (MTF) [47, 48]. The higher the STI is, the more intelligible a sound field is. The STI can be calculated from the RIR as follows, which was standardized by IEC 60286-16:2020 [10].

First, an RIR $h(t)$ passes through seven octave-band filters to obtain the MTFs at 14 specific modulation frequencies from the corresponding temporal envelopes as:

m_{k}(f_{m,i})=\frac{\int_{0}^{\infty}h_{i,\rm{oct}}^{2}(t)\exp(-j2\pi f_{m}t)% dt}{\int_{0}^{\infty}h_{i,\rm{oct}}^{2}(t)dt},

(1)

where $i=1,2,...,7.$ , $k=1,...,7$ , $h_{i,\rm{oct}}$ denotes the RIR at each octave band, and $m_{k}(f_{m,i})$ represents the MTF for the $i$ -th octave band at the $k$ -th specific modulation frequency. The 14 specific modulation frequencies are $f_{m,i}(\rm{Hz})=\{0.63,\ 0.80,\ 1.00,\ 1.25,\ 1.60,\ 2.00,\ 2.50,\ 3.15,\ 4.0% 0,\ 5.00,\\ \ 6.30,\ 8.00,\ 10.00,\ 12.50\}$ . Then, the modulation distortion ratio is calculated as follows:

N_{k,i}=10\log_{10}\bigg{[}\frac{m_{k}(f_{m,i})}{1-m_{k}(f_{m,i})}\bigg{]}.

(2)

The transmission index at each octave band is normalized to the unit scale by limiting the range of $N_{k,i}$ relative to $15$ dB, which is determined as:

T(k,i)=\begin{cases}1,\hskip 56.0pt\ N(k,i)>15,\\ \frac{N(k,i)+15}{30},\ -15\leq N(k,i)\leq 15,\\ 0,\hskip 55.0pt\ N(k,i)<-15.\end{cases}

(3)

Finally, the STI is calculated as the weighted sum of $T(k,i)$ :

\rm{STI}=\sum_{k=1}^{7}w_{k}\Big{[}\frac{1}{14}\sum_{i=1}^{14}T(k,i)\Big{]},

(4)

where $\rm{w}=\{0.129,\ 0.143,\ 0.114,\ 0.114,\ 0.186,\ 0.171,\ 0.143\}$ .

Percentage articulation loss of consonants. $\%\rm{AL}_{\rm{cons}}$ accounts for the measurement of incorrectly understood consonants, and this measure was originally introduced by Peutz and Kelin [11]. Since the STI does not account for the way in which a listener’s proficiency and linguistic knowledge affect the intelligibility, $\%\rm{A}_{\rm{cons}}$ assumes that consonants are harder to comprehend than vowels in a room. The utility of $\%\rm{A}_{\rm{cons}}$ extends the limitations of the STI by not discounting significant intelligibility-related information and considering linguistic proficiency. Moreover, its robustness against contamination from guessing makes it a great indicator of speech intelligibility [6]. Thus, $\%\rm{AL}_{\rm{cons}}$ emerges as an indispensable complement to the STI for comprehensively evaluating speech intelligibility within various room settings. $\%\rm{AL}_{\rm{cons}}$ can be steadily calculated from the STI according to Farrell Becker’s empirical formula [49]:

\%\rm{AL}_{\rm{cons}}=170.5045\cdot e^{-5.419\cdot\rm{STI}}.

(5)

II-A2 Reverberation Parameters

The reverberation time ( $T_{60}$ ) and early decay time (EDT) are pertinent to the reverberation and quantify the subjective impression of the vivacity of a sound field. $T_{60}$ is the most essential RAP since it characterizes the physical properties of the RACs for which the reverberation energy is distributed within - $60$ dB. The EDT represents the decay time for the initial - $10$ dB to emphasize the more important contribution of early reflection to the perceived reverberation. $T_{60}$ is the 60-dB decay time calculated by line-fitting to the proportion of the energy decay curve (EDC) of the RIR from $-5$ dB to $-35$ dB and linearly extrapolating it to $-60$ dB. Similarly, the EDT is to line-fit the initial $-10$ dB of the EDC and to extrapolate to - $60$ dB.

II-A3 Energy Parameters

Clarity ( $C_{50}$ and $C_{80}$ ), definition ( $D_{50}$ ), and center time ( $T_{s}$ ) are the energy parameters used to measure the energy ratio of the RIR between the energy contributed from early reflections and late reverberation. They are strongly related to the impression of transparency.

Clarity. $C_{80}$ and $C_{50}$ express the logarithmetic ratio of the energy within the first $50$ ms for speech and that within the first $80$ ms for music to the remaining RIR, thereby characterizing the clarity perception of a speech or music signal traversing within a room. Clarity can be defined as:

C_{t_{e}}=10\log_{10}\Bigg{(}\frac{\int_{0}^{t_{e}}h^{2}(t)dt}{\int_{t_{e}}^{% \infty}h^{2}(t)dt}\Bigg{)}

(6)

where $t_{e}$ denotes $50$ or $80$ ms, respectively.

Definition. $D_{50}$ indicates the subjective intelligibility of speech in a room, which is defined as the ratio of the energy received within $50$ ms to the total energy of the RIR.

D_{50}=\frac{\int_{0}^{50\ \rm{ms}}h^{2}(t)dt}{\int_{0}^{\infty}h^{2}(t)dt}% \times 100.

(7)

Center time. $T_{s}$ refers to “the center of gravity time”, characterizing the balance between clarity and reverberation that is related to speech intelligibility. $T_{s}$ is given by:

T_{s}=\frac{\int_{0}^{\infty}th^{2}(t)dt}{\int_{0}^{\infty}h^{2}(t)dt}.\vspace% {-1.5em}

(8)

Refer to caption — Figure 1: Overview of the architecture of the blind estimator of room acoustic and physical parameters (BERP). The input includes the observed speech signals within a room, which are the observed noisy and crowded reverberant speech signals, while the output contains the estimated RAPs and RPPs detailed in Section. II-A and II-B, respectively. The architecture can adapt to various input length without the need for length alignment. Fig. 7-9 shows the detailed architectures of the room feature encoder, Fig. 10 corresponds for the parametric predictor, and Fig. 12 is the architecture the acoustical bias corrector.

II-B Room Physical Parameters

RPPs are parameters related to the physical characteristics of a room. These parameters encompass the geometric room volume, sound source distance, DOA of the sound source, and instantaneous occupancy level around the listener’s location.

II-B1 Geometric Room Volume

The geometric room volume $V$ is a position-independent parameter for modeling the attributes of a room. $V$ is strongly related to the estimation of the critical distance ( $D_{c}$ ), which is the distance from the sound source at which the energy density of the reverberant signal is equal to that of the direct signal [27]. $D_{c}$ can be approximated using Sabine’s formula:

D_{c}=\sqrt{\frac{\varrho A}{16\pi}}\approx 0.1\sqrt{\frac{\varrho V}{\pi T_{6% 0}}}

(9)

where $\varrho$ signifies the source directivity factor, and A represents the equivalent absorption area of a room. $D_{c}$ is vital for determining whether a virtual sound source should be rendered with reverberation, thereby serving as a key distance cue for the perception of reverberation by the listener [27, 33].

Furthermore, the mixing time used in AR rendering applications [5] can be determined from $V$ as $t_{m}=\sqrt{V}$ . Jot et al. [5] identified room volume as the reverberation fingerprint to characterize rooms for spatial AR rendering. $V$ also plays an important role in the speech intelligibility [6]. The critical distance of intelligibility ( $D_{ci}$ ), which acts as a distance cue for perceived intelligibility, is derived from $V$ as:

D_{ci}=0.2\sqrt{\frac{\varrho V}{T_{60}}}.

(10)

$\%\rm{AL}_{\rm{cons}}$ also exhibits a strong relationship with $V$ , which can be alternatively expressed as[6]:

{\%\rm{AL}_{\rm{cons}}}=\frac{200D^{2}T^{2}_{60}}{\varrho V}+\mathfrak{c}.

(11)

$D$ is the sound source distance, and $\mathfrak{c}$ is the correction factor.

II-B2 Sound Source Distance

The sound source distance $D$ contributes significantly to complementing the sound source localization (SSL) by integrating it with the DOA of the sound source [29]. The SSL is widely used in applications such as sound source separation [3], audio-oriented and navigational systems [8], speech-related applications [30], and human-robot interaction [26]. Furthermore, $D$ is intimately related to the perception of speech intelligibility, particularly in terms of $\%\rm{AL}_{\rm{cons}}$ , as elaborated in Eq. (11).

II-B3 Direction-of-Arrival of the Sound Source

As mentioned in Section II-B2, the DOA is a crucial component of SSL [30], and it has several applications in sound source separation [4], speech recognition [20], speech enhancement [9], and room acoustical analysis [7]. In this work, the DOA is represented by a pair including an azimuth ( $\theta$ ) and elevation ( $\psi$ ) and is denoted as $\rm{DOA}\coloneqq\{\theta,\psi\}$ .

II-B4 Instantaneous Occupancy Level

The detection of the instantaneous occupancy level of room $N$ around a listener’s location is highly useful for several applications. It is commonly known that the number of occupants affects the reverberation [50], thus affecting the efficacy of demand-driven hearing aid systems and speech enhancement methods. Additionally, the interference speeches generated by the occupants around the listener affect the target signals that the listener intends to receive. Knowing the occupancy level helps control interference to achieve intelligible and clear transmission.

In the context of smart homes, the estimated number of occupants can optimize the control of demand-driven heating, ventilation, and air conditioning (HVAC) operations in the local space to significantly reduce the cost of building operations for sustainable smart buildings [51, 52]. In the XR and AR scenarios, the local occupancy level, as an important factor in environmental information factor, is fundamental for ensuring safe interaction in real-world scenes, especially in public spaces populated by others.

III Proposed Method

Overview: BERP. Fig. 1 shows the signal flow process within the proposed BERP framework. The input waveform is converted into a spectrogram-variant feature representation, which is subsequently fed into the room feature encoder (RFE). Finally, parametric predictors (PP) and a fully-connected (FC) layer output room parameters based on different estimation tasks for different real-world scenarios. When estimating RAPs and RPPs, except the occupancy level, noisy reverberant speech signals serve as the observed signal inputs for the featurizer. In contrast, when estimating the instantaneous occupancy level $N$ , the crowded reverberant speech signals are the inputs of the featurizer.

III-A Signal Models

III-A1 Noisy Reverberant Signal Model

The observed noisy reverberant signal, as perceived by a listener while transmitting from a speaker within a room and subject to the influence of the background environmental noise, can be formulated as:

y_{\rm{nr}}(t)=x(t)*h(t)+n(t)

(12)

where $y_{\rm{nr}}(t)$ denotes the noisy reverberant signal as perceived by the listener, $h(t)$ denotes the RIR, and $n(t)$ represents the background noise that is prevalent in the listener’s local surroundings. The symbol $``*"$ denotes the convolution operation.

The $y_{\rm{nr}}$ encapsulates the RIR information that fully characterizes the RACs in the listener’s local space, including RAPs and room volume. In addition, it contains information pertaining to sound source localization, such as the distance, azimuth, and elevation. Therefore, this signal model is instrumental for parameterizing the listener’s local acoustic space, encompassing aspects such as the volume, the distance and DOAs of the sound source, and RAPs. The noisy reverberant signal model is employed to model the real-world scenarios in which a listener interacts with a single speaker in the presence of environmental noise, as illustrated in Fig. 2.

III-A2 Crowded Reverberant Signal Model

Currently, the research domain lacks a comprehensive reverberant speech corpus for crowded environments that can enable the estimation of the occupancy level around the listener, which encompasses complete meta-information, including the number of speakers, the spatial geometry of speaker distribution relative to the listener, and the local RACs where the listener occupies. We introduce a novel signal model that incorporates these detailed meta-information to address this gap, as shown in Fig 2.

This signal model can be expressed as:

\displaystyle y_{\rm{cr}}(t)

\displaystyle=\sum_{i=1}^{N}\Big{[}\frac{d_{0}}{d_{i}}A_{0}x_{i}(t)*h(t)\Big{]% }=\Big{[}\sum_{i=1}^{N}\frac{d_{0}}{d_{i}}A_{0}x_{i}(t)\Big{]}*h(t),

(13)

where $y_{\rm{cr}}$ signifies the crowded reverberant speech signal, $x_{i}(t)$ represents the speech signal originating from the $i$ -th speaker proximal to the listener, and $d_{i}$ denotes the distance between the $i$ -th speaker and the listener, which adheres to a Gaussian distribution. $A_{0}$ represents the baseline amplitude observed at a distance of $d_{0}$ from the listener, and $h(t)$ denotes the RIR that delineates the acoustic characteristics of the local room. Here, $d_{0}$ is assumed to be equal to 1. $N$ represents the total count of speakers, i.e., the occupancy level, according to a gamma distribution, which is well-suited for modeling real-life events that yield only positive results.

When develo** this speech signal model, a set of fundamental assumptions are postulated. These assumptions are instrumental for enabling an approximation that closely mirrors real-world scenarios while effectively mitigating the intricacies embedded within the observed speech signal, thereby devising a theoretically sound and practically viable model.

Assumption 1.

We hypothesize that the maximum spatial extent surrounding the listener is approximately 6 meters, which is grounded in the fact that speech signals originating from the occupants near the listener undergo an attenuation of approximately 35 dB, rendering them nearly imperceptible as distinguishable speech and essentially inaudible[53, 54]. Thus, crowded speech signals radiating beyond this 6-meter threshold are considered background environmental noise.

Assumption 2.

The model assumes that the upper limit imposed on the number of speakers near the listener, i.e., $N$ , is restricted to $12$ . This premise is substantiated by empirical findings, which suggest that excessively overlap** concurrent speech signals tends to amalgamate into singular background noise, consequently diminishing their individual discernibility as separate speech elements.

Assumption 3.

In everyday settings, particularly within a confined small area such as an area possessing a semidiameter of 6 meters, it is more common for a listener to engage with approximately 3 to 4 occupants. Hence, it is postulated that within a zone rounded by a 6-meter semidiameter, the listener predominantly encounters an average of $4$ occupant speakers.

III-A3 Sparse Stochastic Impulse Response Model

Within the scope of dynamic blind RAPs and RPPs estimation, our access is confined to an observed noisy reverberant signal. Hence, we model the observed signal as in Eq. (12). The ill-posed nature of blind estimation necessitates an RIR model to approximate an unknown RIR for serving as a bridge between the sound source signal and the perceived noisy reverberant signal.

Moreover, to reduce the computational complexity and to facilitate their simultaneous estimation, it is more efficient to model the RIR and subsequently estimate the parameters of this RIR model. This approach enables the simultaneous derivation of the RAPs from the modeled RIR instead of directly estimating them from the noisy reverberant signal.

The RIR can be categorized as the isolated (early reflections) and dense room modes (late reverberation), respectively, by applying modal theory to the room frequency response [55], i.e., the Fourier transform of the RIR, by using Schroeder’s frequency [56].

Badeau [55] introduced a unified mathematical framework for stochastically modeling the RIR, according to the image-source principle [27], as shown in Fig. 3. This work reported that the image sources (i.e., reflections) are distributed according a uniform Poisson distribution. The author further demonstrated that this stochastic distribution of image sources remains invariant regardless of the sound source’s and the receiver’s locations. Alternatively, based on billiard theory, Polack [57] showed that the Poisson distribution of image sources is also independent of the room geometry. Additionally, Traer and McDermott [40] analyzed the RIR statistics. They found that, during the time interval of dense room modes, the RIR exhibits a Gaussian distribution; this was in stark contrast to the time interval of isolated room modes, which exhibited a non-Gaussian distribution.

Drawing inspirations of [57, 40, 55], we present a novel stochastic RIR model, namely, the sparse stochastic impulse response (SSIR) model. This model combines the different stochastic properties of the isolated and dense room modes of the RIR. Specifically, the time interval of the isolated room modes is dominated by uniform Poisson-distributed image sources with their sparsity proportional to the room volume as $h_{\rm{i}}(t)\sim\boldsymbol{P}(\lambda\lvert V\rvert)$ . Conversely, the time interval related to the dense room modes presents a Gaussian distribution as $h_{\rm{d}}(t)\sim\boldsymbol{N}(0,1)$ . Here, $h_{\rm{i}}(t)$ and $h_{\rm{d}}(t)$ represent the isolated and dense room modes of the RIR, respectively, and $V$ denotes the room volume. Fig. 4 shows the fitting of the proposed SSIR model to the realistic RIR.

The SSIR model can be defined as:

h_{\rm{ssir}}(t)=\begin{cases}h_{\rm{i}}(t)=be^{\alpha t/T_{i}}\odot\textit{{P% }}(\lambda\lvert V\rvert),\hskip 15.0ptt\in[0,T_{i})\\ h_{\rm{d}}(t)=be^{-\alpha t/T_{d}}\odot\textit{{N}}(0,1),\hskip 9.0ptt\in[T_{i% },T_{d}]\end{cases}

(14)

\textit{{P}}(\lambda\lvert V\rvert)=\frac{\lambda^{V}\cdot e^{-\lambda}}{V!},

(15)

\textit{{N}}(0,1)=\frac{e^{-t^{2}/2}}{\sqrt{2\pi}},

(16)

where $T_{i}$ and $T_{d}$ are two parameters that control the exponentially ascending and descending temporal envelopes of the RIR, respectively. The constant $\alpha=6.9$ is known as Schroeder’s coefficient [58], and $\lambda$ , which is equal to $\mu$ , signifies the average of $T_{i}$ across the sample set. Here, $\mu$ is empirically determined to be $0.0399$ .

III-B Datasets

A significant challenge encountered when using a data-driven method for the task of blind room acoustical estimation lies in the quality and coverage of the collected data, which are crucial for ensuring the capability of satisfactory generalization. Therefore, it is essential to construct a dataset characterized by large-scale quantity, substantial diversity, and detailed annotations of RAPs and RPPs. We collect adequate realistic RIRs, encompassing an extensive range of RIRs derived from various rooms with different volumes and geometries, distinct sound source and receiver locations, unique sound absorption coefficients of the room surfaces. Hence, it can contain a wide spectrum of broadband RAPs and RPPs. Furthermore, the dataset is augmented to refine the distribution of the annotations, thereby maximizing the diversity and representativeness of the dataset.

III-B1 RIR Data Collection

We aggregated five extensive realistic RIR datasets to construct a composite RIR dataset for representing a wide range of room geometries and RACs. These datasets are the Arni RIR dataset [59], the Motus dataset [60], the BUT ReverbDB [61], the ACE corpus of RIRs [37], and the OpenAIR dataset [62]. Each dataset comprises monochanneled and omnidirectionally recorded RIRs. We resampled all RIRs to $16$ kHz.

III-B2 Speech and Noise Data Collection

To replicate the background environmental noise encountered in real-world scenarios, instead of using synthetic white Gaussian noise, we employ the actual noise samples from real-world daily life circumstances. We integrate noise signals from the DEMAND [63] and BUT[61] noise datasets, both of which are collected in real-world daily life environments and resampled to $16$ kHz.

We use the LibriSpeech corpus [64] for sampling the sound source speech signals when synthesizing the observed reverberant signals. Specifically, we select a 360-hours clean subset. This subset is composed of more than 100,000 unique clips articulated by 921 speakers with completely distinct linguistic contents. The deployment of this dataset ensures a broad spectrum of diverse speech signals, enhancing the robustness and generalizability of our synthesized signals.

III-C Data Preparation

III-C1 Synthesis of Noisy Reverberant Speech Signals

In the composited RIR dataset with detailed annotations of $T_{i}$ , $T_{d}$ ; the parameters of the SSIR RIR model; and metrics related to the room volume $V$ , the sound source distance $D$ and DOA { $\theta$ , $\psi$ } of the sound source, we further employ data augmentation strategy. The strategy involves data upsampling and downsampling techniques to modulate the distribution of the labels, which mitigates potential biases in the data distributions for obtaining more natural distributions. The degrees of upsampling and downsampling are calibrated based on the relative rarity of the values of each label. After the data augmentation process is applied to the RIR dataset, a comprehensive collection of $47,430$ realistic RIRs is successfully compiled. This RIR dataset contains a wide range of RIRs, for which the corresponding $T_{60}$ spans from $0.18$ to $8.00$ sec.

Then, we randomly sampled 47,430 clips from the LibriSpeech corpus, choosing clips with the most common length (from $12$ to $17$ sec.) to be the sound source speech signals, regardless of the speaker information and linguistic content that they contain. In parallel, noise signals are randomly sampled, following an independent and identically distributed (I.I.D.) pattern, from the DEMAND and BUT datasets. Then, in accordance with Eq. (12), we synthesize the noisy reverberant speech signals. To enhance the robustness and efficacy of the model across diverse noisiness environments, the signal-to-noise ratio (SNR) between the reverberant and noise signals is uniformly varied by adjusting the SNR at five different levels, ranging from 0 dB to 20 dB in 5 dB increments, including a scenario with no noise (Inf). Given the uniqueness of each clip, we guarantee that every synthesized speech signal maintains its individuality in terms of both its waveform and linguistic content, further augmenting the diversity and richness of the synthesized dataset.

III-C2 Synthesis of Crowded Reverberant Speech Signals

Initially, we apply voice activity detection to the LibriSpeech corpus to segment and annotate the timestamps corresponding to speech and silence segments. This process underlies the annotations of the synthesized crowded reverberant signals. Then, the gamma distribution of the occupancy levels is modeled. Explicitly, in rooms of varying volumes, the occupancy level $N$ follows a Gaussian distribution $N\sim\textit{{N}}(N,1)$ , accompanying with the real-world principle that larger spaces typically accommodate more occupants, while smaller spaces accommodate fewer occupants.

The Gaussian mixture distribution is used to approximate the gamma distribution of the occupancy level, as detailed in Eq. (13). Furthermore, the distribution of the distance $d_{i}$ from the $i$ -th occupant speaker to the listener is governed by a Gaussian distribution $d_{i}\sim\textit{{N}}(\mu_{d},1)$ , where $\mu_{d}$ represents the mean of the maximum and minimum distances. In accordance with Assumption 1 (Sec. III-A2), $\mu_{d}$ is set to 2.5.

Using Eq. (13), we synthesize the crowded reverberant speech signals by superimposing the speech signals uniformly sampled from the LibriSpeech corpus, aligning with the annotated speech and silence segmentation. The initiation index for each overlap** speech signal is determined based on an I.I.D. pattern. Additionally, to authentically replicate the local room acoustics, the room volumes in the RIRs are precisely matched with their corresponding RIRs, thereby ensuring a realistic acoustic environment. Finally, we obtain a dataset comprising 47,430 samples of crowded reverberant speech signals, ranging from 10 to 25 sec. Fig. 6 shows an example of a crowded reverberant speech signal and its corresponding occupancy level according to Eq. (13) and the aforementioned synthesis strategy.

III-D Estimation Framework Architecture

III-D1 Featurization

We use the three types of featurization methods to represent the observed input signals, including Gammatonegram, MFCC, and mel spectrogram. A Gammatonegram emphasizes the importance of low-frequency sections while a signal propagates within a room [65, 34]. While the MFCC characterizes the shape of the spectral envelope of a reverberant signal, closely related to the MTF of room acoustics [58]. The mel spectrogram rather mimicks human subjective perceptions to the RACs.

III-D2 Room Feature Representation Learning

We use a room feature encoder (RFE) to learn room feature representations.

Room Feature Encoder. The RFE is structured into eight blocks, each block comprising four components. It incorporates a half-residual feedforward network, a multiheaded self attention, a convolutional network, and another half-residual feedforward network [66].

This encoder integrates the CNNs and transformer models, both of which account for gras** local and global acoustic features, respectively, since Wang et al. [28] showed that the acoustical information spreads the overall frequency components of the reverberant signal. Such integration makes it particularly well suited for learning the sophisticated map**s between the noisy and crowded reverberant speech signals with complex waveforms and the desired room parameters.

The signal flow from the input feature representation $\mathbcal{x}_{i}$ to the latent variable output $\mathbcal{y}_{i}$ across each block is mathematically expressed as:

\mathbcal{x}_{i}^{\ddagger}=\mathbcal{x}_{i}+\frac{1}{2}\cdot\textbf{FFN}(% \mathbcal{x}_{i}),

(17)

\mathbcal{x}_{i}^{\ddagger\ddagger}=\mathbcal{x}_{i}^{\ddagger}+\textbf{% LayerNorm}[\textbf{MHSA}(\mathbcal{x}_{i}^{\ddagger})],

(18)

\mathbcal{x}_{i}^{\ddagger\ddagger\ddagger}=\mathbcal{x}_{i}^{\ddagger\ddagger% }+\textbf{Conv}(\mathbcal{x}_{i}^{\ddagger\ddagger}),

(19)

\mathbcal{y}_{i}=\textbf{LayerNorm}\Big{[}\mathbcal{x}_{i}^{\ddagger\ddagger% \ddagger}+\frac{1}{2}\cdot\textbf{FFN}(\mathbcal{x}_{i}^{\ddagger\ddagger% \ddagger})\Big{]},

(20)

where FFN denotes the feedforward network, MHSA denotes the multiheaded self attention, Conv denotes the convolutional network, and LayerNorm represents the layer normalization operation, respectively.

Feedforward network. The feedforward network is composed of a layernorm layer, linear layers with 2048 hidden and 512 embedding dimensions with the swish activation function [67], and a dropout layer of $0.1$ dropout rate. Fig. 7 visualizes the architecture of this module.

Multiheaded self attention. The multiheaded self attention with extrapolatable relative positional encoding (xPos) enhances the ability of the model to grasp the global comprehensive acoustical information encapsulated in feature representations [68]. Fig. 8 shows the corresponding architecture. The xPos encoding strategy has been empirically validated to augment the stabilization and robustness of the self attention mechanism, particularly for sequences with various length.

The xPos-based relative self attention can be formulated as follows:

\textbf{RelAttn}(\mathbcal{x})=\text{softmax}\Bigg{(}\frac{\boldsymbol{Q}_{% \mathbcal{x},\rm{xPos}}\boldsymbol{K}_{\mathbcal{x},\rm{xPos}}^{T}}{\sqrt{% \mathbcal{D}_{h}}}\boldsymbol{M}\Bigg{)}\boldsymbol{V}_{\mathbcal{x}}

(21)

where $\boldsymbol{Q}_{\mathbcal{x},\rm{xPos}}=(\textbf{W}_{q}\mathbcal{C}+\mathfrak{% R}_{\boldsymbol{Q}}\mathbcal{S})\mathbcal{T}$ , $\boldsymbol{K}_{\mathbcal{x},\rm{xPos}}=(\textbf{W}_{k}\mathbcal{C}+\mathfrak{% R}_{\boldsymbol{K}}\mathbcal{S})\mathbcal{T}^{-1}$ , and $\boldsymbol{V}_{\mathbcal{x}}=\textbf{W}_{v}\mathbcal{x}$ . $\mathbcal{C}$ is equal to $\cos(m\vartheta_{i})$ and $\mathbcal{S}$ is equal to $\sin(m\vartheta_{i})$ , which are the cosine and sine positions at the embedding dimension $i$ and the time slice $m$ , respectively. $\mathfrak{R}$ corresponds to the rotary matrix of $\boldsymbol{Q}$ and $\boldsymbol{K}$ . $\mathbcal{D}_{h}$ is the head dimension of the attention mechanism. “T” denotes transposition. $\mathbcal{T}=\varsigma_{m,i}$ . The $\varsigma_{i}$ is given by:

\varsigma_{i}=\frac{i/\frac{\mathbcal{D}_{h}}{2}+\beta}{1+\beta}.

(22)

where $\beta$ is the optimal setting and $\vartheta_{i}=10000^{-2i/\mathbcal{D}_{h}}$ . $\textbf{W}_{q}$ , $\textbf{W}_{k}$ , $\textbf{W}_{v}$ and $\boldsymbol{M}$ are trainable weighting matrices of query, key, and value of the attention mechanism, respectively.

Convolutional network. The convolutional network functions in capturing the local acoustic features and reinforcing the temporal causality of the feature representation. This module leverages prenorm residual connection with gating mechanisms to distill the important acoustical characteristics via pointwise and depthwise convolutional and a gated linear unit (GLU) layers [69], as illustrated in Fig. 9.

III-D3 Regression Estimation of the Room Parameters

For the regression task of room parameter estimation, we employ a parametric predictor (PP) solely for $T_{i}$ , $T_{d}$ , $V$ , $D$ , and utilize both PP and an acoustical bias corrector (ABC) for $\theta$ , and $\psi$ .

Parametric Predictor. The PP employs several convolutional layers with ReLU activation functions to compose a nonlinear regression function, allowing us to utilize the encoded representations within the latent space to predict the physically-meaningful room parameters. These parameters includes the two parameters of the SSIR model $\hat{T}_{i}$ and $\hat{T}_{d}$ , the room volume $\hat{V}$ , the sound source distance $\hat{D}$ , and DOA of the sound source $\hat{\theta}$ and $\hat{\psi}$ . The behavior of the predictor can be mathematically determined as follows:

\boldsymbol{\gamma}=f_{\rm{pred}}(\mathbcal{y}_{enc})

(23)

where $\boldsymbol{\gamma}$ denotes the room parameters output from the PP, which is a constant function alongside the time axis. During inference, we sum up and average the values over the time axis to obtain a single predicted room parameter. The overall architecture of the PP is presented in Fig. 10.

Acoustical Bias Corrector. The ABC acts as a gating mechanism to differentiate between the biased and unbiased data encoded within the latent space, thereby directing the optimal signal flow into the PP and ensuring that the PP can learn from the unbiased data distribution. Additionally, we leak some biased data into the PP to make it robust to biases. The necessity of such a mechanism arises in the context of room parameters such as the sound source azimuth $\theta$ and elevation $\psi$ , whose distributions exhibit the substantial inherent biases that are difficult to mitigate through conventional data augmentation techniques, as shown in Fig. 11. The biases often lead to the regression of trivial results, i.e., the mean of the whole distribution.

The ABC comprises a sandwich structure, characterizing the rotary-positional self attention[70] as a feature enhancer to assign the different attention weights to all latent spectro-temporal feature representation in a frame-by-frame manner. The rotary-position encoding approach also contributes to stabilizing the training process. It is also adaptable to variable latent input lengths without the need for alignment. The output bias probability $p_{\rm{abc}}(\mathbcal{\hat{y}})$ of the ABC can be mathematically expressed as follows:

p_{\rm{abc}}(\mathbcal{\hat{y}})=f_{\rm{corr}}(\mathbcal{y}_{enc}).

(24)

Fig. 12 depicts the entire architecture of the ABC. “GELU” and “Sigmoid” denote the GELU [71] and sigmoid activation functions, respectively.

III-D4 Classification Estimation of the Room Parameters

A FC layer is engineered to regress the encoded feature representations to physics-informed instantaneous occupancy levels as a time sequence derived from the observed crowded signals. The pivot of this architecture is substantiated by Assumption 2 (Section III-A2), which facilitates forecasting a time series from a regression task to a classification task, thus simplifying the complexity of the task while improves the robustness of the prediction. Considering that the occupancy level exhibits no significant temporal dependence, we instead simply adopt a linear layer with log-softmax activation function to predict the occupancy level rather than recurrent or autoregressive structures. The resolution of estimation process is about $62.5$ Hz, i.e., $16$ ms per frame, for predicting the occupancy level around the listener’s location.

III-E Joint Estimation Framework

We explore joint estimation framework for estimating multiple RAPs and RPPs simultaneously. The underlying hypothesis posits that the RAPs, which are directly derived from the RIR, share the same reverberation information encapsulated in the RIR. Additionally, the observed reverberant signal embodies crucial physical information related to volume and sound-source characteristics. Furthermore, the interdependency between the RAPs and RPPs plays a pivotal role in improving the robustness and efficacy of the estimation, which is anticipated to improve the accuracy of the joint estimation strategy to be at least similar to that of the separate strategy. These hypotheses substantiate the feasibility of develo** a universal model, which is a promising approach for efficiently analyzing the room acoustics within a unified estimation framework instead of training the multiple separate models.

The architecture of the joint estimation method is illustrated in Fig. 1. The unified and occupancy modules originate from the noisy reverberant signal $y_{\rm{nr}}$ and the crowded reverberant signal $y_{\rm{cr}}$ , respectively. Within the joint framework, the unified RFE serves as the foundational component across all PPs, facilitating the mutual exchange of interdependent information among the RAPs and RPPs in the latent space. Subsequently, each targeted room parameter is tasked with regressing a distinct function by using a dedicated predictor for desired room parameters. The configuration of each PP, as well as that of the ABC, is described in Section III-D3.

III-F Loss Function

III-F1 Loss for the Parametric Predictor

We employ the Huber loss [72] to optimize the PPs across each targeted room parameter. The Huber loss of the PPs is defined as:

\mathcal{L}_{\rm{pred}}(\gamma,\hat{\gamma})=\begin{cases}\frac{1}{2\mathcal{N% }}\sum_{n=1}^{\mathcal{N}}\sum_{k=1}^{\mathcal{K}}(\gamma_{n}-\hat{\gamma}_{n,% k})^{2},\hskip 19.0pt\lvert\gamma_{n}-\hat{\gamma}_{n,k}\rvert\leq\delta\\ \delta\frac{1}{\mathcal{N}}\sum_{n=1}^{\mathcal{N}}\sum_{k=1}^{\mathcal{K}}% \lvert\gamma_{n}-\hat{\gamma}_{n,k}\rvert-\frac{1}{2}\delta^{2},\hskip 2.0pt% \lvert\gamma_{n}-\hat{\gamma}_{n,k}\rvert>\delta\end{cases}

(25)

where $\gamma$ is the targeted room parameter and $\delta$ is set to 1. The symbol $\mathcal{N}$ denotes the batch size, and $\mathcal{K}$ represents the time frame length. The Huber loss possesses the dual sensitivity of the minimum-variance estimation by the $\mathcal{L}$ 2 loss and the robustness of the median-aware estimation against outliers by the $\mathcal{L}$ 1 loss. It also circumvents the convergence problem of the $\mathcal{L}$ 1 loss on a small scale [73] and contributes to preventing exploding gradients by clip** gradients exceeding $\delta$ .

III-F2 Loss for the Acoustical Bias Corrector

Considering the prediction task of the ABC is binary, i.e., distinguishing between unbiased and biased data, we adopt the binary cross-entropy (BCE) for optimization, which is defined as follows:

\mathcal{L}_{\rm{corr}}=-\frac{1}{\mathcal{N}}\sum_{n=1}^{\mathcal{N}}y^{\rm{% abc}}_{n}\cdot\log\big{(}p_{\rm{abc}}(\hat{y}_{n})+(1-y^{\rm{abc}}_{n})\\ \cdot\log\big{(}1-p_{\rm{abc}}(\hat{y}_{n})\big{)},

(26)

where $y^{\rm{abc}}_{n}$ denotes the ground-truth label presenting acoustical bias, and $p_{\rm{abc}}(\hat{y}_{n})$ denotes the predicted bias probability output from the ABC.

III-F3 Loss for the Occupancy Module

The occupancy module utilizes the cross-entropy (CE), reflecting the multiclass nature of the occupancy level estimation task. This loss function is determined as follows:

\mathcal{L}_{\rm{occu}}=-\frac{1}{\mathcal{N}}\sum_{n=1}^{\mathcal{N}}\sum_{c=% 0}^{\mathcal{C}}N_{n,c}\log\big{(}p_{\rm{crowd}}(\hat{y}^{\rm{crowd}}_{n,c})% \big{)},

(27)

and $\mathcal{C}$ , the upper bound of the occupancy level, is set to 12 according to Assumption 2 (detailed in Section III-A2). Here, $\hat{y}^{crowd}_{n,c}$ represents the logits output from the FC, $p_{\rm{crowd}}$ denotes the output probability after softmax function, and $N_{n,c}$ denotes the ground-truth instantaneous occupancy level.

III-F4 See-Saw Loss

When estimating the sound source azimuth and elevation, the ABC is deployed to counteract significant bias inherent within the data distribution. Nevertheless, a critical issue arises from the disparate gradient descent rates of the two employed loss functions (the Huber and BCE losses). The gradient descent rate for the BCE loss is significantly faster than that for the Huber loss, which causes the training instability, specifically when the BCE loss approaches overfitting whereas the Huber loss still remains underfitting.

Therefore, we introduce a new loss function, namely, the see-saw loss, to solve this disparity. This loss function can adaptively balance the gradient descent rates of BCE and Huber losses, thus stabilizing the training process. The see-saw loss function devised for DOA estimation is formulated as follows:

\mathcal{L}_{\rm{see-saw}}(\theta;\psi,\hat{\theta};\hat{\psi})=\mathfrak{w}_{% \rm{corr}}^{\prime}(\mathcal{L}^{\rm{az}}_{\rm{corr}}+\mathcal{L}^{\rm{elev}}_% {\rm{corr}})\\ +\frac{\mathfrak{w}_{\rm{pred}}[\mathfrak{w}^{\prime}_{\rm{pred}}\mathcal{L}_{% \rm{pred}}(\theta,\hat{\theta})+\mathfrak{w}^{\prime\prime}_{\rm{pred}}% \mathcal{L}_{\rm{pred}}(\psi,\hat{\psi})]}{1+\mathfrak{w}^{\prime\prime}_{\rm{% corr}}(\mathcal{L}^{\rm{az}}_{\rm{corr}}+\mathcal{L}^{\rm{elev}}_{\rm{corr}})},

(28)

where $\mathcal{L}_{\rm{see-saw}}(\theta;\psi,\hat{\theta};\hat{\psi})$ denotes the total loss. The components $\mathcal{L}^{\rm{az}}_{\rm{corr}}$ and $\mathcal{L}^{\rm{elev}}_{\rm{corr}}$ represent the BCE losses of the azimuth and elevation through the ABC, respectively. $\mathcal{L}_{\rm{pred}}(\theta,\hat{\theta})$ and $\mathcal{L}_{\rm{pred}}(\psi,\hat{\psi})$ correspond to the Huber losses for the azimuth and elevation using PPs. $\mathfrak{w}^{\prime}_{\rm{corr}}$ and $\mathfrak{w}^{\prime\prime}_{\rm{corr}}$ are the weight coefficients of the bias correctors. $\mathfrak{w}_{\rm{pred}}$ , $\mathfrak{w}^{\prime}_{\rm{pred}}$ , and $\mathfrak{w}^{\prime\prime}_{\rm{pred}}$ are the corresponding weight coefficients of the predictors.

III-F5 Polynomial See-Saw Loss for Joint Estimation

We introduce a loss function that combines polynomial losses with see-saw loss for the joint estimation framework.

The polynomial see-saw loss $\mathcal{L}_{\rm{unified}}$ is formulated as follows:

\mathcal{L}_{\rm{unified}}(T_{i};T_{d};V;D;\theta;\psi,\hat{T}_{i};\hat{T}_{d}% ;\hat{V};\hat{D};\hat{\theta};\hat{\psi})=\\ \mathfrak{w}_{T_{i}}\mathcal{L}_{\rm{pred}}(T_{i},\hat{T}_{i})+\mathfrak{w}_{T% _{d}}\mathcal{L}_{\rm{pred}}(T_{d},\hat{T}_{d})+\mathfrak{w}_{V}\mathcal{L}_{% \rm{pred}}(V,\hat{V})\\ +\mathfrak{w}_{D}\mathcal{L}_{\rm{pred}}(D,\hat{D})+\mathcal{L}_{\rm{see-saw}}% (\theta;\psi,\hat{\theta};\hat{\psi}),

(29)

where $\mathfrak{w}_{T_{i}}$ , $\mathfrak{w}_{T_{d}}$ , $\mathfrak{w}_{V}$ , and $\mathfrak{w}_{D}$ are weighting coefficients for losses $\mathcal{L}_{\rm{pred}}(T_{i},\hat{T}_{i})$ , $\mathcal{L}_{\rm{pred}}(T_{d},\hat{T}_{d})$ , $\mathcal{L}_{\rm{pred}}(V,\hat{V})$ , and $\mathcal{L}_{\rm{pred}}(D,\hat{D})$ , respectively. The weighting ratio is arranged as: $\mathfrak{w}_{T_{i}}:\mathfrak{w}_{T_{d}}:\mathfrak{w}_{V}:\mathfrak{w}_{D}:% \mathfrak{w}^{\prime}_{\rm{corr}}:\mathfrak{w}^{\prime\prime}_{\rm{corr}}:% \mathfrak{w}_{\rm{pred}}:\mathfrak{w}^{\prime}_{\rm{pred}}:\mathfrak{w}^{% \prime\prime}_{\rm{pred}}=5.0:5.0:5.0:5.0:0.1:0.1:0.5:10.0:1.0$ .

III-G Evaluation Metrics

We employ the mean absolute error (MAE) and the Pearson correlation coefficient (PCC) as the evaluation metrics. The MAE provides a direct measure of the scale of the average estimation error and the PCC is introduced to quantify the invariant similarity of the estimated and ground-truth values.

TABLE I: MAE and PCC comparisons among variants of the proposed BERPs with different featurizations and baselines for the room parameters derived from the estimation frameworks, i.e., the neural network models. All models were sufficiently trained until convergence. Gammatone, Mel, and MFCC denote the featurization methods.

		$T_{i}$	$T_{d}$	$V$	$D$	$\theta$	$\psi$	$N$
	(joint)	[ $s$ ]	[ $s$ ]	[ $\log_{10}(m^{3})$ ]	[ $m$ ]	[rad]	[rad]
MAE $\downarrow$	Full-CNN[32, 33, 34]	1.7720	1.0050	1.3210	3.8260	3.2610	0.5966	-
	CRNN[35, 36]	0.0238	0.2968	0.3189	1.8730	0.3685	0.1002	-
	TAE-CNN[39]	0.0404	0.4608	0.4328	3.8700	0.3157	0.0684	-
	RE-NET[38]	0.1844	1.1460	0.5102	4.6180	0.8614	0.8183	-
	BERP-Gammatone	0.0030	0.0341	0.0373	0.6918	0.2451	0.0684	0.5519
	BERP-Mel	0.0018	0.0221	0.0272	0.5070	0.1967	0.0695	0.5370
	BERP-MFCC	0.0019	0.0264	0.0271	0.5375	0.1899	0.0734	0.5411
PCC $\uparrow$	Full-CNN[32, 33, 34]	0.1543	0.6431	0.3268	0.5731	0.0329	0.0116	-
	CRNN[35, 36]	0.6914	0.9356	0.7450	0.8512	0.1157	0.1756	-
	TAE-CNN[39]	-	-	0.5555	0.5194	-	-	-
	RE-NET[38]	0.2395	0.1579	0.3961	0.3285	0.0629	0.1579	-
	BERP-Gammatone	0.9437	0.9929	0.9731	0.9271	0.6311	0.6936	-
	BERP-Mel	0.9691	0.9971	0.9705	0.9503	0.7017	0.7342	-
	BERP-MFCC	0.9667	0.9951	0.9733	0.9520	0.7325	0.7342	-

TABLE II: Evaluation results obtained for the RAPs derived from the SSIR RIR model of the proposed BERP.

		STI	$\%\rm{AL}_{cons}$	$T_{60}$	EDT	$C_{80}$	$C_{50}$	$D_{50}$	$T_{s}$
	(joint)		[ $\%$ ]	[ $s$ ]	[ $s$ ]	[dB]	[dB]	[ $\%$ ]	[ $s$ ]
MAE $\downarrow$	BERP-Gammatone	0.0544	4.1388	0.0342	0.3378	2.966	3.3556	14.7659	0.0539
	BERP-Mel	0.0534	4.0794	0.0221	0.3282	2.9051	3.3135	14.5699	0.0528
	BERP-MFCC	0.0540	4.0877	0.0265	0.3325	2.9498	3.3418	14.6950	0.0532
PCC $\uparrow$	BERP-Gammatone	0.9477	0.8660	0.9976	0.9870	0.9047	0.8370	0.8221	0.9772
	BERP-Mel	0.9501	0.8682	0.9994	0.9892	0.9097	0.8412	0.8263	0.9802
	BERP-MFCC	0.9490	0.8671	0.9960	0.9864	0.9082	0.8397	0.8251	0.9777

TABLE III: Results of an ablation study concerning separate estimation pipelines. The MAE and PCC attained by the proposed BERP and the baselines in separate estimation pipelines are presented. All the models were sufficiently trained until convergence. We used the MFCC featurization method for the BERP.

		$T_{i}$	$T_{d}$	$V$	$D$	$\theta$	$\psi$
	(separate)	[ $s$ ]	[ $s$ ]	[ $\log_{10}(m^{3})$ ]	[ $m$ ]	[rad]	[rad]
MAE $\downarrow$	Full-CNN[32, 33, 34]	0.0704	0.3085	0.5282	4.8520	0.3139	0.0702
	CRNN[35, 36]	0.0177	0.2927	0.1597	1.6540	0.2583	0.0701
	TAE-CNN[39]	0.0341	1.7320	3.1200	7.704	0.3157	0.0683
	RE-NET[38]	0.0341	0.6283	0.4963	5.2390	0.3140	0.0733
	BERP	0.0025	0.0322	0.0382	0.6413	0.2074	0.0569
PCC $\uparrow$	Full-CNN[32, 33, 34]	0.2660	0.9377	-	-	-	-
	CRNN[35, 36]	0.5202	0.9481	0.9221	0.8859	0.3397	0.2612
	TAE-CNN[39]	-	-	-	-	-	-
	RE-NET[38]	0.1159	0.6722	0.3293	0.1243	0.0341	0.0296
	BERP	0.9597	0.9976	0.9641	0.9336	0.6173	0.6595

When estimating the occupancy levels as a time sequence, we choose only the MAE to quantify the Euclidean distance between the estimated and ground-truth occupancy sequences.

IV Experiments

IV-A Experimental Setup

Training strategy. We randomly split the $47,430$ total distinct clips of noisy and crowded reverberant speech signals into three segments, training, validation, and test datasets, following the I.I.D. paradigm. We allocates 2000 clips each to the validation and test datasets and the remaining 43,430 clips are for the training dataset. The padding mask is deployed to ensure that the framework learns only the valid information across each minibatch. The RAdam optimizer with $\mathcal{L}2$ regularization is used [74], which possesses a functionality of learning rate warmup without the risk of underfitting the regression tasks. We utilize cosine-annealing and tri-stage learning rate scheduler for unified and occupancy modules, respectively, to facilitate the convergence of the models toward the global optimums. We set a batch size of $12$ . Given the wide range of room volumes spanning from $40$ to $9000$ $m^{3}$ , we apply logarithmic scaling to compress them, stabilizing the training process and improving model robustness. Unitary linear normalization is applied to standardize the gradient update rate to ensure a uniform descent across labels.

TABLE IV: Results of an ablation study concerning disentangling PP. The PP is replaced with a simple linear layer to investigate the contribution of the PP. The MFCC featurization is employed across all evaluations.

	Architecture	$T_{i}$	$T_{d}$	$V$	$D$
	(separate)
MAE $\downarrow$	BERP w/o PP	0.1620	1.0023	0.0508	0.5960
MAE $\downarrow$	BERP	0.0025	0.0322	0.0382	0.6413
PCC $\uparrow$	BERP w/o PP	0.6013	0.9211	0.9554	0.9343
PCC $\uparrow$	BERP	0.9579	0.9948	0.9641	0.9336
	(joint)
MAE $\downarrow$	BERP w/o PP	0.1174	1.1707	0.4173	8.4852
MAE $\downarrow$	BERP	0.0019	0.0264	0.0271	0.5375
PCC $\uparrow$	BERP w/p PP	0.6219	0.7821	0.7626	0.6724
PCC $\uparrow$	BERP	0.9667	0.9951	0.9733	0.9520

Featurizer configuration. We set a uniform configuration for all spectrogram-variant featurizers. They each contain the same 128 Gammatone, mel filterbank, and DCT bins channels; windowing with size of $1024$ ; and a $75\%$ overlap** rate.

Baselines. In our comparative experiments, we evaluated the performance of our proposed method in comparison with four baseline architectures that are renowned in the domain of room parameter estimation amidst background noise: the Full-CNN [32, 33, 34], the CRNN [35, 36], the TAE-CNN [39], and the RE-NET [38]. These SOTA frameworks were deployed in both joint and separate estimation tasks.

IV-B Results

IV-B1 Evaluation of the Room Parameters Derived from Frameworks

We evaluated the proposed BERP and the baseline frameworks by using the same dataset as detailed in Section III-C1 and the same data segmentation setting for the joint estimation of the room parameters $T_{i}$ , $T_{d}$ , $V$ , $D$ , $\theta$ , $\psi$ and $N$ output from the trained models.

Table I shows the estimation accuracies achieved by the BERP across three featurizations, alongside a comparison with the baselines. The BERP significantly outperforms the SOTA architectures in terms of the MAE and PCC evaluation metrics. Even for parameters such as the azimuth $\theta$ and elevation $\psi$ , which are subject to significant data distribution biases, the BERP maintains its effectiveness. Moreover, the performance comparison among the three featurizers indicates that the MFCC featurizer yields the most favorable outcomes, which supports our assertion regarding the intrinsic relevance of MFCC to room acoustics, highlighting its fitness to blind estimation of room parameter.

IV-B2 Evaluation of the Room Acoustic Parameters Using the SSIR Model

Table II shows the estimation results obtained for the RAPs derived from the synthesized RIR using the SSIR RIR model. These results indicate the effectiveness of the SSIR for modeling realistic RIRs and subsequently deriving RAPs, highlighting the ability of the SSIR model to capture the essence of real-world RIRs for the precise estimation of RAPs. Specifically, when applied to mel spectrogram featurizer, the BERP achieves better performance.

TABLE V: Results of an ablation study concerning the efficacy of the ABC. We utilize BERPs with and without the ABC to investigate the efficiency of the ABC for used in the orientation module only. The featurization method is MFCC. The ABC significantly improves the task of regressing azimuth and elevation parameters with inherent distribution biases.

	Orientation Module	$\theta$	$\psi$
PCC $\uparrow$	BERP w/o ABC	0.2724	0.6099
PCC $\uparrow$	BERP	0.6173	0.6595
MAE $\downarrow$	BERP w/o ABC	0.2823	0.0574
MAE $\downarrow$	BERP	0.2268	0.0574

IV-B3 Ablation Study

Separate estimation pipelines. To further investigate the efficacy of our specifically designed RFE, PP, and ABC to estimate the RAPs and RPPs, we employed the separate estimation tasks. It is also conducted to verify our hypothesis asserted in Section III-E, in which the unified encoder promotes the effectiveness of the estimation. This ablation study comprises four separate pipelines, each of which is dedicated to map** the observed speech signals to the corresponding target RAPs and RPPs, which include the RIR, volume, distance, and orientation modules. The RIR module concurrently estimates the two parameters $T_{i}$ and $T_{d}$ of the SSIR model. The volume and distance modules estimates the room volume $V$ and the sound source distance $D$ , respectively. Finally, the orientation module tests the task of simultaneously estimating the DOA of the sound source, i.e., $\theta$ and $\psi$ . The results are shown in Table III, showing that the BERP also significantly outperforms the current methods. These results indicate that the proposed framework is effective even though in separate estimations. They also substantiate our hypothesis that joint estimation enhances the estimation accuracy via the mutual interdependence of room parameters and facilitates the sufficient and efficient learning for the neural networks.

Without the parametric predictor. To dissect the contribution of the PP to the overall performance of the BERP, we conducted the ablation study of dissecting the PP. We employed the separate pipelines for the three modules, RIR, volume, and distance modules. The joint framework is also tested by discarding the estimation of the DOA since the orientation module integrates the ABC. We consistently used the MFCC featurizer. Table IV compares the results obtained using solely the RFE with those achieved by using the full architecture equipped with the PP. The results show that the PP contributes significantly to the performance of the BERP, especially in the joint estimation.

Without the acoustical bias corrector. To understand the efficacy of the ABC, we conducted an ablation study with or without this bias corrector in terms of estimating the sound source azimuth $\theta$ and elevation $\psi$ . Importantly, the PCC is much more representative than the MAE for evaluating the performance achieved on datasets with biased data distributions. Table V indicates that the ABC significantly mitigates the intrinsic bias within the dataset, proving the efficacy of the ABC for use with substantially biased data distributions.

V Conclusion

We propose the BERP, a universal blind estimation framework designed for simultaneously estimating several RAPs and RPPs, i.e., speech transmission index (STI), articulation loss of consonants ( $\%\rm{AL}_{\rm{cons}}$ ), reverberation time ( $T_{60}$ ), early decay time (EDT), clarity ( $C_{80}$ and $C_{50}$ ), definition ( $D_{50}$ ), center time ( $T_{s}$ ), room volume ( $V$ ), sound source distance $D$ , DOA of the sound source ( $\theta$ and $\psi$ ), and instantaneous occupancy level ( $N$ ). The BERP provides a new paradigm for blind estimation in room acoustics. This framework can blindly evaluate the RAPs and RPPs simultaneously within a wide range of realistic acoustical environments to parameterize the listener’s local RACs, promising it has a wide variety of applications in room acoustics, hearing aids, communications, and human-machine interactions [14, 16, 17, 21, 22, 24, 19, 25, 15, 5, 23, 13, 75, 26]. We incorporate a new stochastic RIR model, namely, the SSIR model, to realize the concurrent and efficient estimation of RAPs without increasing the computational complexity of the framework. This scheme avoids the use of complicated optimization processes across the significant disparity of the values of the different RAPs. Moreover, the BERP fills the gap in the domain, i.e., the lack of a universal framework for blindly estimating these room parameters from single-channel noisy speech signals, especially for the sound source distance, DOA of the sound source, and instantaneous occupancy level. The evaluation results show that the proposed BERP framework greatly outperforms the current methods and achieves SOTA performance by simultaneously estimating thirteen room-acoustics-related parameters for the first time.

Regarding the limitations of this study and future work, importantly, except for occupancy level estimation, the BERP assumes a dynamic-movement, single-source speech signal as the observed input. Future research will aim to address the blind estimation of RAPs and RPPs for multisource speech signals by develo** a unified signal model that can accommodate both noisy and crowded reverberant signals in real-world environments. This extension will further expand the applicability of the proposed framework to more complex realistic acoustic scenarios.

VI Acknowledgements

We appreciate the great help from and beneficial discussions with **an Chen for this work.

References

[1] M. Barron, Auditorium Acoustics and Architectural Design (2nd ed.). London: Routledge, 2009.
[2] A. Tsilfidis, I. Mporas, J. Mourjopoulos, and N. Fakotakis, “Automatic speech recognition performance in different room acoustic environments with and without dereverberation preprocessing,” Computer Speech & Language, vol. 27, no. 1, pp. 380–395, 2013. Special issue on Paralinguistics in Naturalistic Speech and Language.
[3] T. Jenrungrot, V. Jayaram, S. Seitz, and I. Kemelmacher-Shlizerman, “The cone of silence: Speech separation by localization,” in Advances in Neural Information Processing Systems, 2020.
[4] S. E. Chazan, H. Hammer, G. Hazan, J. Goldberger, and S. Gannot, “Multi-microphone speaker separation based on deep doa estimation,” 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5, 2019.
[5] J.-M. Jot and K. S. Lee, “Augmented reality headphone environment rendering,” in Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality, Sep 2016.
[6] J. van der Werff and D. de Leeuw, “What you specify is what you get (part 1),” in Audio Engineering Society Convention 114, Mar 2003.
[7] S. V. Amengual Garí, W. Lachenmayr, and E. Mommertz, “Spatial analysis and auralization of room acoustics using a tetrahedral microphone,” The Journal of the Acoustical Society of America, vol. 141, pp. EL369–EL374, 04 2017.
[8] C. Chen, U. Jain, C. Schissler, S. V. A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson, and K. Grauman, “Soundspaces: Audio-visual navigation in 3d environments,” in Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI, (Berlin, Heidelberg), p. 17–36, Springer-Verlag, 2020.
[9] A. Xenaki, J. Bünsow Boldt, and M. Græsbøll Christensen, “Sound source localization and speech enhancement with sparse Bayesian learning beamforming,” The Journal of the Acoustical Society of America, vol. 143, pp. 3912–3921, 06 2018.
[10] IEC 60268-16:2020, Sound system equipment - part 16: Objective rating of speech intelligibility by speech transmission index. 2020.
[11] V. M. A. Peutz and W. Kelin, “Articulation loss of consonants influenced by noise,” Reverberation and Echo,” (in Dutch), vol. 28, pp. 11–18, Acoust. Soc. Netherlands.
[12] ISO 3382:2009, Acoustics - measurements of room acoustics parameters - part 1: Performance spaces. 2009.
[13] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, S. Gannot, and B. Raj, “The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4, 2013.
[14] L. Frenkel, S. E. Chazan, and J. Goldberger, “Domain adaptation using suitable pseudo labels for speech enhancement and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1226–1236, 2024.
[15] H. Morgenstern and B. Rafaely, “Spatial reverberation and dereverberation using an acoustic multiple-input multiple-output system,” Journal of the Audio Engineering Society, vol. 65, p. 42–55, Feb. 2017.
[16] T. Gajecki and W. Nogueira, “A fused deep denoising sound coding strategy for bilateral cochlear implants,” IEEE Transactions on Biomedical Engineering, pp. 1–11, 2024.
[17] E. P. Reynders, J. Van den Wyngaert, M. Verlinden, and G. Vermeir, “Development and performance assessment of sound absorbing chandeliers for reverberation control and improved verbal communication in large rooms,” Applied Acoustics, vol. 218, p. 109874, 2024.
[18] D. Fogerty, A. Alghamdi, and W.-Y. Chan, “The effect of simulated room acoustic parameters on the intelligibility and perceived reverberation of monosyllabic words and sentences,” The Journal of the Acoustical Society of America, vol. 147, pp. EL396–EL402, 05 2020.
[19] B. Eurich, T. Klenzner, and M. Oehler, “Impact of room acoustic parameters on speech and music perception among participants with cochlear implants,” Hearing Research, vol. 377, pp. 122–132, 2019.
[20] H.-Y. Lee, J.-W. Cho, M. Kim, and H.-M. Park, “Dnn-based feature enhancement using doa-constrained ica for robust speech recognition,” IEEE Signal Processing Letters, vol. 23, no. 8, pp. 1091–1095, 2016.
[21] G. Yenduri, R. M, P. K. R. Maddikunta, T. R. Gadekallu, R. H. Jhaveri, A. Bandi, J. Chen, W. Wang, A. A. Shirawalmath, R. Ravishankar, and W. Wang, “Spatial computing: Concept, applications, challenges and future directions,” 2024.
[22] H. M. Kamdjou, D. Baudry, V. Havard, and S. Ouchani, “Resource-constrained extended reality operated with digital twin in industrial internet of things,” IEEE Open Journal of the Communications Society, vol. 5, pp. 928–950, 2024.
[23] J. Nikunen and T. Virtanen, “Direction of arrival based spatial covariance model for blind sound source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 3, pp. 727–739, 2014.
[24] A. Taghipour, S. Athari, A. Gisladottir, T. Sievers, and K. Eggenschwiler, “Room acoustical parameters as predictors of acoustic comfort in outdoor spaces of housing complexes,” Frontiers in Psychology, vol. 11, p. 344, 03 2020.
[25] H. Dong and C. Lee, “Speech intelligibility improvement in noisy reverberant environments based on speech enhancement and inverse filtering,” J AUDIO SPEECH MUSIC PROC., vol. 3, 2018.
[26] X. Li, L. Girin, F. Badeig, and R. Horaud, “Reverberant sound localization with a robot head based on direct-path relative transfer function,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), p. 2819–2826, IEEE Press, 2016.
[27] H. Kuttruff, Room Acoustics. Taylor & Francis, 2016.
[28] L. Wang, S. Duangpummet, and M. Unoki, “Blind estimation of speech transmission index and room acoustic parameters by using extended model of room impulse response derived from speech signals,” IEEE Access, vol. 11, pp. 49431–49444, 2023.
[29] S. S. Kushwaha, I. R. Roman, M. Fuentes, and J. P. Bello, “Sound source distance estimation in diverse and dynamic acoustic conditions,” in 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5, 2023.
[30] P.-A. Grumiaux, S. Kitić, L. Girin, and A. Guérin, “A survey of sound source localization with deep learning methods,” The Journal of the Acoustical Society of America, vol. 152, pp. 107–151, 07 2022.
[31] C. Molnar and T. Freiesleben, Supervised Machine Learning For Science. 2024.
[32] C. Ick, A. Mehrabi, and W. **, “Blind acoustic room parameter estimation using phase features,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.
[33] A. F. Genovese, H. Gamper, V. Pulkki, N. Raghuvanshi, and I. J. Tashev, “Blind room volume estimation from single-channel noisy speech,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 231–235, 2019.
[34] H. Gamper and I. J. Tashev, “Blind reverberation time estimation using a convolutional neural network,” in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140, 2018.
[35] P. S. López, P. Callens, and M. Cernak, “A universal deep room acoustics estimator,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 356–360, 2021.
[36] P. Callens and M. Cernak, “Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks,” 2020.
[37] J. Eaton, N. Gaubitch, A. Moore, and P. Naylor, “Estimation of room acoustic acparameters: The ace challenge,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, pp. 1–1, 06 2016.
[38] K. Zheng, C. Zheng, J. Sang, Y. Zhang, and X. Li, “Noise-robust blind reverberation time estimation using noise-aware time–frequency masking,” Measurement, vol. 192, p. 110901, 2022.
[39] S. Duangpummet, J. Karnjana, W. Kongprawechnon, and M. Unoki, “Blind estimation of speech transmission index and room acoustic parameters based on the extended model of room impulse response,” Applied Acoustics, vol. 185, p. 108372, 2022.
[40] J. Traer and J. H. McDermott, “Statistics of natural reverberation enable perceptual separation of sound and space,” Proceedings of the National Academy of Sciences, vol. 113, no. 48, pp. E7856–E7865, 2016.
[41] C. Christensen, G. Koutsouris, and J. Rindel, “The iso 3382 parameters: Can we simulate them? can we measure them?,” vol. 20, 06 2013.
[42] R. Kliper, H. Kayser, D. Weinshall, I. Nelken, and J. Anemüller, “Monaural azimuth localization using spectral dynamics of speech,” in Proc. Interspeech 2011, pp. 33–36, 2011.
[43] R. Takashima, T. Takiguchi, and Y. Ariki, “Single-channel multi-talker-localization based on maximum likelihood,” in 2009 IEEE/SP 15th Workshop on Statistical Signal Processing, pp. 461–464, 2009.
[44] F. Toole, Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms. Audio Engineering Society Presents, Taylor & Francis, 2017.
[45] S. Cerdá, A. Giménez, J. Romero, R. Cibrián, and J. Miralles, “Room acoustical parameters: A factor analysis approach,” Applied Acoustics, vol. 70, no. 1, pp. 97–109, 2009.
[46] M. Queiroz, F. Iazzetta, F. Kon, M. H. A. Gomes, F. L. Figueiredo, B. Masiero, L. K. Ueda, L. Dias, M. H. C. Torres, and L. F. Thomaz, “Acmus: An open, integrated platform for room acoustics research - journal of the brazilian computer society,” 2013.
[47] T. Houtgast and H. J. M. Steeneken, “The modulation transfer function in room acoustics as a predictor of speech intelligibility,” The Journal of the Acoustical Society of America, vol. 54, no. 2, pp. 557–557, 1973.
[48] H. J. M. Steeneken and T. Houtgast, “A physical method for measuring speech‐transmission quality,” The Journal of the Acoustical Society of America, vol. 67, no. 1, pp. 318–326, 1980.
[49] T. Houtgast and H. J. M. Steeneken, “A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria,” The Journal of the Acoustical Society of America, vol. 77, pp. 1069–1077, 03 1985.
[50] O. Shih and A. Rowe, “Occupancy estimation using ultrasonic chirps,” in Proceedings of the ACM/IEEE Sixth International Conference on Cyber-Physical Systems, ICCPS ’15, (New York, NY, USA), p. 149–158, Association for Computing Machinery, 2015.
[51] H. Qian, G. Zhenhao, and L. Chao, “Occupancy estimation in smart buildings using audio-processing techniques,” in International Conference on Computing in Civil and Building Engineering (ICCCBE) 2016, 2016 Fall.
[52] A. Ebadat, G. Bottegal, D. Varagnolo, B. Wahlberg, H. Hjalmarsson, and K. H. Johansson, “Blind identification strategies for room occupancy estimation,” in 2015 European Control Conference (ECC), pp. 1315–1320, 2015.
[53] M. West, “The sound attenuation in an open-plan office,” Applied Acoustics, vol. 6, no. 1, pp. 35–56, 1973.
[54] P. Somervuo, P. Lauha, and T. Lokki, “Effects of landscape and distance in automatic audio based bird species identification,” The Journal of the Acoustical Society of America, vol. 154, pp. 245–254, 07 2023.
[55] R. Badeau, “Common mathematical framework for stochastic reverberation models,” The Journal of the Acoustical Society of America, vol. 145, pp. 2733–2745, 04 2019.
[56] M. R. Schroeder and K. H. Kuttruff, “On Frequency Response Curves in Rooms. Comparison of Experimental, Theoretical, and Monte Carlo Results for the Average Frequency Spacing between Maxima,” The Journal of the Acoustical Society of America, vol. 34, pp. 76–80, 01 1962.
[57] J.-D. Polack, “Playing billiards in the concert hall: The mathematical foundations of geometrical room acoustics,” Applied Acoustics, vol. 38, no. 2, pp. 235–244, 1993.
[58] M. R. Schroeder, “Modulation transfer functions: Definition and measurement,” Acta Acustica united with Acustica, vol. 49, no. 3, pp. 179–182, 1981.
[59] K. Prawda, S. J. Schlecht, and V. Välimäki, “Calibrating the Sabine and Eyring formulas,” The Journal of the Acoustical Society of America, vol. 152, pp. 1158–1169, 08 2022.
[60] G. Götz, S. J. Schlecht, and V. Pulkki, “A dataset of higher-order ambisonic room impulse responses and 3d models measured in a room with varying furniture,” in 2021 Immersive and 3D Audio: from Architecture to Automotive (I3DA), pp. 1–8, 2021.
[61] I. Szöke, M. Skácel, L. Mošner, J. Paliesek, and J. Černocký, “Building and evaluation of a real room impulse response dataset,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 863–876, 2019.
[62] D. T. Murphy and S. Shelley, “Openair: An interactive auralization web resource and database,” in Audio Engineering Society Convention 129, Nov 2010.
[63] J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” Proceedings of Meetings on Acoustics, vol. 19, p. 035081, 05 2013.
[64] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015.
[65] P. Srivastava, A. Deleforge, and E. Vincent, “Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators,” in 2022 International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 1–5, 2022.
[66] A. Gulati, C.-C. Chiu, J. Qin, J. Yu, N. Parmar, R. Pang, S. Wang, W. Han, Y. Wu, Y. Zhang, and Z. Zhang, eds., Conformer: Convolution-augmented Transformer for Speech Recognition, 2020.
[67] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, OpenReview.net, 2018.
[68] Y. Sun, L. Dong, B. Patra, S. Ma, S. Huang, A. Benhaim, V. Chaudhary, X. Song, and F. Wei, “A length-extrapolatable transformer,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (A. Rogers, J. Boyd-Graber, and N. Okazaki, eds.), (Toronto, Canada), pp. 14590–14604, Association for Computational Linguistics, 2023.
[69] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, p. 933–941, JMLR.org, 2017.
[70] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024.
[71] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” 2023.
[72] P. J. Huber, “A robust version of the probability ratio test,” Annals of Mathematical Statistics, vol. 36, pp. 1753–1758, 1965.
[73] L. Ciampiconi, A. Elwood, M. Leonardi, A. Mohamed, and A. Rozza, “A survey and taxonomy of loss functions in machine learning,” 2023.
[74] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” 2021.
[75] I.-J. Jung and J.-G. Ih, “Distance estimation of a sound source using the multiple intensity vectors,” The Journal of the Acoustical Society of America, vol. 148, pp. EL105–EL111, 07 2020.