\UseRawInputEncoding

RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

Bing Yang1,3, Changsheng Quan1, Yabo Wang1, Pengyu Wang1, Yujie Yang1,
Ying Fang1, Nian Shao1, Hui Bu2, Xin Xu2, Xiaofei Li1,3

1School of Engineering, Westlake University
2Bei**g AIShell Technology Co. Ltd
3Institute of Advanced Technology, Westlake Institute for Advanced Study
{yangbing, quanchangsheng, wangyabo, wangpengyu, yangyujie,
fangying, shaonian, lixiaofei}@westlake.edu.cn

[email protected], [email protected]
Corresponding author.
Abstract

The training of deep learning-based multichannel speech enhancement and source localization systems relies heavily on the simulation of room impulse response and multichannel diffuse noise, due to the lack of large-scale real-recorded datasets. However, the acoustic mismatch between simulated and real-world data could degrade the model performance when applying in real-world scenarios. To bridge this simulation-to-real gap, this paper presents a new relatively large-scale Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset111The RealMAN dataset is publicly available at https://github.com/Audio-WestlakeU/RealMAN.. The proposed dataset is valuable in two aspects: 1) benchmarking speech enhancement and localization algorithms in real scenarios; 2) offering a substantial amount of real-world training data for potentially improving the performance of real-world applications. Specifically, a 32-channel array with high-fidelity microphones is used for recording. A loudspeaker is used for playing source speech signals. A total of 83-hour speech signals (48 hours for static speaker and 35 hours for moving speaker) are recorded in 32 different scenes, and 144 hours of background noise are recorded in 31 different scenes. Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments, which enables the training of general-purpose speech enhancement and source localization networks. To obtain the task-specific annotations, the azimuth angle of the loudspeaker is annotated with an omni-direction fisheye camera by automatically detecting the loudspeaker. The direct-path signal is set as the target clean speech for speech enhancement, which is obtained by filtering the source speech signal with an estimated direct-path propagation filter. Baseline experiments demonstrate that i) compared to using simulated data, the proposed dataset is indeed able to train better speech enhancement and source localization networks; ii) using various sub-arrays of the proposed 32-channel microphone array can successfully train variable-array networks that can be directly used to unseen arrays.

1 Introduction

Microphone array-based multichannel speech enhancement and source localization are two important front-end audio signal processing tasks [1, 2, 3]. Currently, most existing works are deep learning-based and data-driven, for which the amount and diversity of microphone array data are crucial for the proper training of neural networks. Although an infinite amount of microphone array data can be simulated using room impulse response (RIR) simulator (with the image method [4]) and multichannel diffuse noise generator [5], there are still significant mismatches between the acoustic properties of the simulated and real-world microphone array data, including 1) The geometry of the room and the presence of obstacles within it. The image method [4] simulates empty box-shaped rooms, which does not accurately reflect the acoustics of the real-world rooms with irregular shapes and built-in obstacles. 2) The directivity of the sound sources and microphones, as well as the wall absorption coefficients [6]. 3) Piece-wise simulation of moving sound source [7]. To simulate microphone signals from the moving source, the continuous trajectory of the sound source is discretized into multiple locations. One signal segment is simulated for each location, and signal segments are then connected. If the sound source moves too fast or the discretization is too sparse, the resulting microphone signals may contain clicking noises and other audible artifacts [7]. 4) The spatial correlation of the simulated multi-channel noise is normally determined under the hypothesis of a theoretical diffuse noise field [5], while the real-world noises could deviate from the theoretical values by a large margin, as shown in Fig. 5. Relevant studies have shown that models trained on simulated data often perform poorly in real-world scenarios [6, 8, 9].

Training with real-world data can avoid these mismatches. However, existing real-world microphone array data with annotated target clean speech and source location information is limited and lacks diversity. To address this, we collect a new Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset from a variety of real-world indoor, outdoor, semi-outdoor, and transportation scenes. These recordings encompass a diverse range of spatial/room acoustics and noise characteristics. The dataset consists of 83 hours of speech and 144 hours of noise, recorded in 32 and 31 different scenes respectively, with both speech and noise recorded in 17 of these scenes. Both static and moving speech sources are included. We provide annotations of source azimuth angle, direct-path target clean speech, and speech transcription, for speech enhancement (and evaluation of automatic speech recognition) and source localization. A 32-channel microphone array is used for recording. End-to-end speech enhancement and source localization models are normally array-dependent, which means that the network trained with one specific array can be only used for the same array. However, collecting real-world data for a new array is cumbersome. One solution is to use data from various arrays to train a variable-array network that can generalize to unseen arrays [10, 11, 12]. Our 32-channel array can provide many different sub-arrays for training such variable-array networks. Baseline experiments have been conducted on the proposed dataset, demonstrating that 1) Comparing with the simulated data, training with the real-world data eliminates the simulation-to-real problem and achieves better performances in speech enhancement and source localization. Thus, the proposed dataset is more suitable for benchmarking new algorithms and reflecting their capabilities; 2) the variable-array networks [10, 12] can be successfully trained with our 32-channel array dataset. Hopefully, these networks can be applied directly to real applications involving unseen arrays.

2 Related Work

It is challenging to collect a large-scale real-recorded and annotated microphone array dataset. Table 1 and 2 summarize the existing multi-channel speech and noise datasets, respectively.

Multi-channel speech recording and annotation. In MIR [13], BUTReverb [14], Reverb [15], DCASE [16], ACE [17] and dEchorate [18], real-world RIRs are measured instead of directly collecting speech recordings. Measuring real-world RIRs offers several advantages: 1) speech recordings can be generated by convolving RIRs with source signals, which provides sufficient data realness; 2) The direct-path signal can be easily obtained by convolving the direct-path impulse response (extracted from RIRs) with source signals, which can be used as the training target signal for speech enhancement; 3) information for source localization, such as the time-difference of arrival, can also be obtained from the multi-channel direct-path impulse responses. However, one significant drawback of RIR measurement is that it is more time-consuming than speech recording. Consequently, existing datasets such as MIR, BUTReverb, Reverb, ACE, and dEchorate offer only a limited range of measured RIRs and scenes. The scenario becomes even more complex with a moving source, as seen in DCASE, which requires measuring RIRs at multiple discrete locations along a single trajectory, making the process substantially more time-consuming.

Other datasets directly provide multi-channel speech recordings. However, annotating the direct-path speech (as the training target signal) for speech enhancement poses challenges. Although some datasets provide close-talking signals, these cannot serve as the target signals since the direct-path speech is essentially an energy-attenuated and time-shifted version of the close-talking signals. To obtain a clean target signal, the CHiME-3 dataset [19] simulates the time delay of the direct-path speech for training, and also provides the speech signals recorded in a booth for development and test. In addition to evaluating the speech quality, speech enhancement can also be evaluated in terms of automatic speech recognition (ASR) performance. LibriCSS [20], MC-WSJ-AV [21], CHiME-5/-6/-7 [22], AMIMeeting [23], AISHELL-4 [24] and AliMeeting [25] (see Appendix for details) provide speech transcriptions for evaluating the ASR performance. Due to the lack of target signal, the speech enhancement network for these datasets can only be trained with simulated data.

DCASE [16] and LOCATA [26] provide the source location. In DCASE, the RIR needs to be finely measured at many pre-defined discrete locations, which is very time-consuming. In LOCATA, source location is obtained through an optical tracking system installed in the computing laboratory. However, setting up the optical tracking system for multiple rooms incurs significant costs.

Table 1: Existing microphone array speech datasets with speech enhancement and/or source localization annotations.
Dataset Diversity, Quantity Main Data Microphone Array (×\times×1 by default)
# Scenes Scene type Source state # RIR / speech duration
MIR [13] 3 Lab Static 78 RIR 8-ch linear (×\times×3)
BUTReverb [14] 9 - Static 51 RIR 8-ch spherical
Reverb [15] 3 - Static 24 RIR 8-ch circular
DCASE [16] 9 Campus Static 38530 RIR, Location 32-ch spherical
ACE [17] 7 Campus Static 14 RIR, RoomInfo 2-ch, 3-ch triangle, 8-ch linear, 5-ch cruciform, 32-ch spherical
dEchorate [18] 11 Lab Static 99 RIR, RoomInfo 5-ch linear (×\times×6)
CHiME-3/-4 [19] 5 Multiple - 9.9 h Recording, Transcription 6-ch rectangular
LOCATA [26] 1 Lab Static+moving 0.9 h Recording, Location 15-ch planar, 32-ch spherical, 12-ch robot, 4-ch hearing aids
RealMAN (prop.) 32 Multiple Static+moving 83 h Recording, Transcription, Direct-path signal, Location 32-ch (include various sub-arrays)
  • # RIR involves room conditions, source positions and array positions

  • RoomInfo. denotes room acoustic information like reverberation time T60, direct-to-reverberant ratio (DRR), etc.

  • Only compact arrays are considered, and single microphone and close-talking (lapel and headset microphones) are excluded

Table 2: Existing microphone array noise datasets.
Dataset Noise Scenes / Types Duration
BUTReverb [14] 9 rooms (large, middle and small size), with room environmental noise (silence) 4.7 h
Reverb [15] 3 rooms (large, medium and small), with stationary background noise mainly caused by air conditioning 0.8 h
ACE [17] 7 offices, with meeting and teaching rooms, babble, ambient and fan noise recorded in each room 13.6 h
dEchorate [18] 1 room with 11 surface absorptions, with diffuse babble, white and silence noises 0.6 h
CHiME-3/-4 [19] 4 public scenarios (cafe, street junction, public transport and pedestrian area) 8.4 h
DCASE [16] 9 rooms in university buildings, with ambient noise 3.9 h
DEMAND [27] 18 scenes (domestic, office, public, transportation, nature categories) 1.5 h
RealMAN (prop.) 31 scenes (indoor, semi-outdoor, outdoor, transportation categories) 144 h
  • DEMAND uses a 16-ch array in 4 staggered rows. The microphone array configurations for these datasets are listed in Table 1.

Multi-channel noise recording. Noise recordings are typically provided either separately [27], or together with RIRs [18, 14, 15, 17, 16] or speech recordings [19]. However, most of these datasets [18, 14, 15, 17, 19, 16, 27] are limited in both quantity and diversity. Although the DEMAND dataset [27] offers noise signals recorded in various scenes with a 16-channel microphone array, but the duration of its recording is quite short.

By comparison, the proposed RealMAN dataset has the following advantages. 1) Realness. Speech and noise are recorded in real environments. Direct recording for moving sources avoids issues associated with the piece-wise generation method. Different individuals move the loudspeaker freely to closely mimic the human walking in real applications. One unreal factor is that we use a loudspeaker playing back speech signals, instead of speaking by real human speakers. 2) Quantity and diversity. We record both speech signals and noise signals across various scenes. Compared with existing datasets, our collection offers greater diversity in spatial acoustics (in terms of acoustic scenes, source positions and states, etc) and noise types. This enables effective training of speech enhancement and source localization networks. 3) Annotation. We provide detailed annotations for source locations, direct-path speech, and speech transcriptions, which are essential for accurate training and evaluation. 4) Number of channels. The number of microphone channels, i.e. 32, is higher than almost all existing datasets, which facilitates the training of variable-array networks. 5) Recording cost. The recording, playback, and camera devices are portable and easily transportable to different scenes. The camera shares a common coordinate center with the microphone array, simplifying the annotation process of source locations.

3 RealMAN Dataset

3.1 Recording system

Refer to caption
(a) Recording devices.
Refer to caption
(b) The geometry of 32-channel microphone array.
Figure 1: Recording devices.

Fig. 1(a) shows the recording system used in this work, which mainly consists of a 32-channel microphone array, a high-fidelity monophonic loudspeaker and a 360-degree fisheye camera.

32-channel microphone array is comprised of 32 high-fidelity Audio-Technica BP899 microphones. The array geometry is shown in Fig. 1(b). This array encompasses the array topology found in common use cases, including common planar linear arrays, circular arrays, and 3D arrays. The sampling rate of microphone recording is 48 kHz. The sampled audio signals are then digitized by 4 clock-synchronized 8-channel microphone pre-amplifiers (RME OctoMic II) and processed by a laptop through an audio interface (Digiface USB).

360-degree fisheye camera: A 360-degree fisheye camera (HIKVISION DS-2CD63C5F-IHV) is placed right above the microphone array. The camera records the 360-degree panoramic image in real time synchronized with the microphone recording. The frame rate of the fisheye camera is 100 ms.

High-fidelity monophonic speaker: A high-fidelity monophonic loudspeaker (FOSTEX 6301 NE) is used to play source speech signals. It is placed on a height-adjustable and mobile carrier such that one can control the position of the loudspeaker to mimic a standing/moving human speaker. A 5-cm diameter LED light is put on the top of the loudspeaker to magnify the visibility of loudspeaker to the the fisheye camera and annotate the position of the loudspeaker. The LED light can emit red or green light, which is visible for the fisheye camera under various of light conditions.

3.2 Source speech signals

Source speech signals that are played by the loudspeaker contains nearly 35 hours of clean Mandarin speech, of which about 30 hours are free-talking and 5 hours are reading. For free talk, speakers are encouraged to converse alone. Reading speech entail speakers reading news articles. The topics of speech content spread a wide range of domains including news reports, games, reading experiences, and life trivia. There are 55 speakers in total, including 27 males and 28 females. 17 speakers are recorded in a studio (T60 is smaller than 150 ms) with a high-fidelity microphone, while the rest are recorded in a living room with a bit larger reverberation time (T60 is about 200 ms) using a lower-fidelity microphone. The distance between speaker and microphone is about 0.2 m.

3.3 Speech and noise recording process

Speech. The objective of the speech recording process is to mirror real-life scenarios of human activities. In each scene, the position of both the camera and microphone array are fixed. When playing source speech, the position of the loudspeaker takes on either static or moving states. For the moving case, one person manually moves the loudspeaker carrier with varying but reasonable moving speed. In transportation scenarios, people typically maintain a stationary position, thereby the loudspeaker only takes the static state. The height of the microphone array is set to 1.40 m. The center height of the loudspeaker is aligned with the height of the mouth of a standing person, varying randomly between 1.30 m and 1.60 m. Most of the time, the loudspeaker faces towards the microphone array. Speech data was recorded across 32 distinct environments, including indoor, outdoor, semi-outdoor, and transportation scenarios. The detailed scene information of the proposed dataset is shown in Appendix B.1. We ensure that most speech recordings were conducted under quiet conditions (usually at midnight), with background noise levels maintained below 40 dB.

Noise. Noise recording is simpler, for which we place the microphone array in various environments to capture the real-world ambient noise. Noise recording is normally conducted in the daytime with active events in each environment. The collected recording clips with noise power lower than a certain threshold are abandoned. Then, an advanced voice activity detection is conducted to further filter out those recording clips including prominent speech signals. Noise is recorded in 31 different scenarios and ultimately retained 144.45 hours of recordings, covering most everyday scenarios.

3.4 Data annotation

Direct-path target clean speech. Deep learning-based speech enhancement (dereverberation and denoising) methods require a target clean signal for training. Normally, the direct-path speech signal is used [28, 29]. For simulated datasets, the direct-path speech can be directly generated using the simulated direct-path impulse response. For real-recorded datasets, providing real direct-path speech has never been solved in the field. Instead of providing a target signal that can be used for training, existing real-recorded datasets either provide speech transcriptions or close-talk speech signals [19, 20, 21, 23].

In this paper, we develop a method to estimate the direct-path speech by source speech (replayed speech) and the microphone recordings. The recording process is formulated in the time domain as x(t)=sdp(t)+srev(t)+n(t)𝑥𝑡subscript𝑠𝑑𝑝𝑡subscript𝑠𝑟𝑒𝑣𝑡𝑛𝑡x(t)=s_{dp}(t)+s_{rev}(t)+n(t)italic_x ( italic_t ) = italic_s start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ) + italic_s start_POSTSUBSCRIPT italic_r italic_e italic_v end_POSTSUBSCRIPT ( italic_t ) + italic_n ( italic_t ), sdp(t)=hdp(t)hdev(t)s(t)=hdev[As(tτ)]subscript𝑠𝑑𝑝𝑡subscript𝑑𝑝𝑡subscript𝑑𝑒𝑣𝑡𝑠𝑡subscript𝑑𝑒𝑣delimited-[]𝐴𝑠𝑡𝜏s_{dp}(t)=h_{dp}(t)*h_{dev}(t)*s(t)=h_{dev}*\left[As(t-\tau)\right]italic_s start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ) = italic_h start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ) ∗ italic_h start_POSTSUBSCRIPT italic_d italic_e italic_v end_POSTSUBSCRIPT ( italic_t ) ∗ italic_s ( italic_t ) = italic_h start_POSTSUBSCRIPT italic_d italic_e italic_v end_POSTSUBSCRIPT ∗ [ italic_A italic_s ( italic_t - italic_τ ) ], where x(t)𝑥𝑡x(t)italic_x ( italic_t ), sdp(t)subscript𝑠𝑑𝑝𝑡s_{dp}(t)italic_s start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ), srev(t)subscript𝑠𝑟𝑒𝑣𝑡s_{rev}(t)italic_s start_POSTSUBSCRIPT italic_r italic_e italic_v end_POSTSUBSCRIPT ( italic_t ) and n(t)𝑛𝑡n(t)italic_n ( italic_t ) are the microphone recording, direct-path speech, speech reverberation, and noise, respectively. Theoretically, the direct-path speech is the convolution of the played source speech s(t)𝑠𝑡s(t)italic_s ( italic_t ), the impulse response of the playing and recording devices hdev(t)subscript𝑑𝑒𝑣𝑡h_{dev}(t)italic_h start_POSTSUBSCRIPT italic_d italic_e italic_v end_POSTSUBSCRIPT ( italic_t ), and the direct-path impulse response hdp(t)subscript𝑑𝑝𝑡h_{dp}(t)italic_h start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ), where the direct-path impulse response hdp(t)subscript𝑑𝑝𝑡h_{dp}(t)italic_h start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ) can be formulated as a level attenuation A𝐴Aitalic_A and a time shift τ𝜏\tauitalic_τ of s(t)𝑠𝑡s(t)italic_s ( italic_t ). Note that A𝐴Aitalic_A and τ𝜏\tauitalic_τ are time-invariant for static source, while time-varying for moving source. The impulse response of devices hdevsubscript𝑑𝑒𝑣h_{dev}italic_h start_POSTSUBSCRIPT italic_d italic_e italic_v end_POSTSUBSCRIPT can be measured in advance. Then, the estimation of sdp(t)subscript𝑠𝑑𝑝𝑡{s}_{dp}(t)italic_s start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ) amounts to the estimation of A𝐴Aitalic_A and τ𝜏\tauitalic_τ according to the known x(t)𝑥𝑡x(t)italic_x ( italic_t ), hdevsubscript𝑑𝑒𝑣h_{dev}italic_h start_POSTSUBSCRIPT italic_d italic_e italic_v end_POSTSUBSCRIPT, and s(t)𝑠𝑡s(t)italic_s ( italic_t ). For details of the algorithm, readers can refer to Appendix C.1.

By informal listening test, the estimated direct-path speech has the same perceptual quality as source speech. In addition, to further evaluate the potential impact of the proposed method, we test the ASR performance with the established ASR model trained on over 10,000 hours of Mandarin dataset WenetSpeech [30] using the ESPNet toolkit. The character error rates (CERs) for the source speech and the estimated direct-path speech are identical, which means the direct-path filtering of source speech do not degrade the intelligibility of speech. Overall, the estimated direct-path speech could be a good target signal for training speech enhancement networks. The intermediate results shown in Appendix C.1 also strongly support our claims.

Sound source location. The annotation of source (loudspeaker) location leverages a fisheye camera and a LED light (placed on top of the loudspeaker). During the speech recording process, the fisheye camera is placed right above the microphone array, and the plane coordinate of the camera and microphone array are aligned. Accordingly, the azimuth angle of source could be calculated by the position of the LED light in the camera image. To locate the LED light in the image, a visual-based LED light is designed and applied on each recorded video frame. To guarantee the accuracy, all the detection results are manually checked. Please refer to Appendix C.3 for the illustration examples and pseudocodes of the LED detection algorithm. According to the frame rate of the fisheye camera, the temporal resolution of the annotation of source location is set to 100 ms.

3.5 Dataset split and statistics

Table 3: Statistics of the training, validation, and test sets of RealMAN.
Training Validation Test
Speech duration (h) 63.50 7.78 11.64
     - Moving speaker (h) 26.57 3.27 4.75
     - Static speaker (h) 36.93 4.51 6.89
Noise duration (h) 106.32 15.95 22.18
Number of scenes 40 17 21
Number of female speakers 23 4 2
Number of male speakers 20 2 4

The recorded speech and noise data are split into training, validation, and test sets for the deep learning-based speech enhancement and source localization methods, according to the acoustic characteristics of the recording scenes and speaker identities:

Acoustic characteristics of scenes. Different scenes have different RIRs and noise characteristics. To make sure that the model can be trained under diverse scenes, there are 40 different scenes (of speech and noise) included in the training set. Various types of acoustic scenes are also provided in the validation and test sets (17 and 21, respectively), such that the algorithms can be fully evaluated under various scenarios. There are 3 scenes that only appeared in the test and validation sets, respectively, to further evaluate the generalization capability on the unseen acoustic scenes. Note that, although some scenes may overlap across sets, there is no data sample overlap among them.

Speaker identities. The entire 55 speakers are splitted into 43, 6, and 6 for the training, validation and test sets, respectively. Following the general speech corpus split, no speaker appears across sets.

Speech and noise matching. During recording, speech and noise are normally recorded in quiet midnight and noisy daytime, respectively. To make the noisy speech as real as possible, it is better to mix speech and noise from the same scene. This principle is followed for validation and test sets. The number of scenes with both speech and noise recordings is 10 out of 17 for validation scenes and 11 out of 21 for test scenes. The same type of indoor environments, such as living rooms or office rooms, have similar noise characteristics, hence the noises are not recorded for every scene. For these cases, we mix the speech of one scene with noise from a similar environment. Specifically, speech of all ClassRooms are mixed with ClassRoom1 noise; speech of all OfficeRooms and Library are mixed with the noises of OfficeRoom1/3; speech of all LivingRooms are mixed with noises of LivingRoom1 and Laundry. As for training, some preliminary experiments show that such match on speech and noise scenes is not required. And a random scene match between speech and noise is more suitable for network training, which is possibly because of the promotion on data diversity.

The statistics of the dataset are shown in Table 3. The total 83.9 hours of the recorded speech are divided into 63.5, 7.8 and 11.6 hours for training, validation and test, respectively. And 144.5 hours of noise data are divided into 106.3, 16.0 and 22.2 hours for training, validation and test, respectively.

4 Baseline Experiments

In this section, we benchmark the proposed dataset for speech enhancement and source localization. As presented in Section 3.5, the training set is generated by randomly mixing speeches and noises, with a signal-to-noise ratio (SNR) uniformly distributed in [0, 15] dB. By contrast, the validation and test sets are generated by mixing speeches and noises from matched scenes, and the signal level of mixed speech and noise are kept unchanged as recorded to maintain their natural loudness.

4.1 Baseline methods and evaluation metrics

Speech enhancement. One popular time-domain network, i.e. FaSNet-TAC [10], and one recently-proposed frequency-domain network, i.e. SpatialNet [28], are used for benchmarking the speech enhancement performance of the proposed dataset. The negative of scale-invariant signal-to-distortion ratio (SI-SDR) [31] is used as the loss function for training the two baseline networks. For FaSNet-TAC, the best configuration reported in its original paper is used. For SpatialNet, to reduce the computational complexity, a tiny version is used, where the hidden size of the SpatialNet-small version reported in the paper [28] is further reduced from 96 to 48.

SI-SDR, WB-PESQ [32], and MOS-SIG, MOS-BAK, MOS-OVR from DNSMOS [33] measures the speech enhancement performances. The ASR performances are evaluated by an established ASR model trained by over 10,000 hours of Mandarin dataset, WenetSpeech [30] using ESPNet toolkit.

Sound source localization. Azimuth angle localization is performed. We adopt a convolutional recurrent neural network (CRNN) as one baseline system for sound source localization. The baseline CRNN comprises a 10-layer CNN and a 1-layer gated recurrent unit. The kernel size of convolutional layers are all 3×3333\times 33 × 3, each convolutional layer is followed by an instance normalization and a rectified linear unit activation function. Max pooling is applied to compress the frequency and time dimensions after every two convolutional layers. This baseline CRNN is very similar to the CRNN network used in many sound source localization methods [34, 35, 36]. The spatial spectrum, with candidate locations of every 1superscript11^{\circ}1 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT azimuth angle, is used as the learning target [8, 37, 38, 39]. A linear classifier with sigmoid activation is used to predict the spatial spectrum. A recently proposed sound source localization method, i.e. IPDnet [12], is also used a baseline system. The hidden size of the original IPDnet is reduced from 256 to 128. Candidate locations are also set as every 1superscript11^{\circ}1 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT azimuth angle.

The localization results are evaluated with i) the Mean Absolute Error (MAE) and ii) Localization Accuracy (ACC) (Nsuperscript𝑁N^{\circ}italic_N start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT), namely the ratio of frames with the estimation error of azimuth less than Nsuperscript𝑁N^{\circ}italic_N start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT.

Table 4: Benchmark experiments of speech enhancement.
Baseline Training Data Static Speaker Moving Speaker
speech noise WB-PESQ SI-SDR MOS-SIG MOS-BAK MOS-OVR CER WB-PESQ SI-SDR MOS-SIG MOS-BAK MOS-OVR CER
unprocessed - 1.15 -9.8 2.00 1.72 1.51 19.9 1.11 -9.1 1.79 1.54 1.36 23.8
FaSNet-TAC [10] sim sim 1.38 -3.4 2.67 3.19 2.22 27.1 1.33 -2.5 2.60 3.12 2.14 29.7
sim real 1.49 -1.7 2.83 3.23 2.35 22.4 1.42 -1.5 2.78 3.20 2.29 25.7
real sim 1.47 0.8 2.67 3.09 2.18 23.7 1.40 0.5 2.58 3.05 2.10 28.2
real real 1.51 1.3 2.80 3.34 2.35 22.4 1.43 1.1 2.73 3.28 2.27 26.3
SpatialNet [28] sim sim 1.40 -8.4 3.09 2.62 2.28 19.2 1.33 -7.9 3.06 2.53 2.23 23.2
sim real 1.45 -2.6 2.58 2.35 1.95 23.0 1.38 -2.6 2.54 2.25 1.89 26.5
real sim 1.96 3.8 3.09 3.06 2.45 17.3 1.80 3.0 3.00 2.99 2.36 21.2
real real 2.10 6.1 3.05 3.51 2.62 16.0 1.90 3.8 2.96 3.45 2.52 21.5

4.2 Benchmark experiments

This section benchmarks the speech enhancement and source localization performance of the proposed dataset, using a 9-channel sub-array (the smallest circle and the center microphone, and the center microphone is used as the reference microphone when necessary). In addition, we also evaluate the effect of the simulation-to-real mismatch on speech enhancement and source localization tasks. Equal amounts of multichannel speech are simulated according to our real-recorded dataset. Specifically, one counterpart utterance is simulated for each real-recorded utterance using the same source speech, room size, T60, and source position/trajectory as the real-recorded utterance. Note that source speech, room size, and T60 of each real-recorded utterance are available in our dataset. The source position/trajectory can be obtained with the calibration parameters of the fisheye camera. However, the elevation angle and source distance are not very accurate, which we think is not problematic for generating the simulated speech data, but the elevation angle and source distance will not be publicly released. Multichannel noise is simulated with the diffuse noise generator [5], using single-channel of white, babble, and factory noise.

The simulated speech/noise can also be combined with real-recorded noise/speech. For each setting, the signal-to-noise (SNR) for mixing speech and noise is uniformly sampled in [0, 15] dB.

Speech Enhancement. The results of speech enhancement are shown in Table 4. Overall, compared to other settings, training with real speech and real noise achieves the best speech enhancement performance on the real-recorded test set, for both the baseline networks. As for intrusive metrics, i.e. WB-PESQ and SI-SDR, the target clean speech provided in this dataset are used as the reference signals, which may leads to some measurement bias for other settings, therefore these metrics are only presented for reference. The non-intrusive DNS-MOS scores can better reflect the speech quality. It can be seen that, for FaSNet-TAC, training with simulated speech and real noise achieves almost the same DNS-MOS performance as the setting of real speech plus real noise, which indicates that speech simulation does not have the simu-to-real problem for speech enhancement. However, this is not the case for SpatialNet, for which training with simulated speech and real noise achieves the worst performance. Some validation experiments have been conducted to figure out the reasons for this phenomenon, and the most possible reason is: there are some slight mismatches between the real and designing microphone positions, which lead to severe overfitting when training with simulated speech and real noise due to some unclear factors. These reflect that different networks exploit different information for speech enhancement, and perform differently under certain conditions as well.

Training with real speech and real noise consistently outperforms training with simulated noise. The simulated noise lacks diversity in terms of both spectral pattern and spatial correlation. The spatial correlation of real noise could largely vary from the one of theoretical diffuse noise field. Moreover, the spatial correlation of real noise is also highly time-varying. Please see Appendix D.2 for some real examples. The mismatch between diffuse noise and real noise leads to the performance degradation when training with simulated noise. In addition, due to high complexity and non-stationarity of the spatial correlation of real noise, it is complicated to develop new techniques to simulate real noise. Therefore, we suggest to use real-recorded noise for training speech enhancement networks.

Overall, the proposed dataset is a difficult one for speech enhancement, due to the large scene diversity, the high realness, and the complex acoustic conditions. The CERs of unprocessed recordings are quite high, i.e. close to or larger than 20%, even though a very strong ASR model (trained with over 10,000 hours data) is used.

Sound source localization. The results of the sound source localization are presented in Table 6. It can be observed that both real speech and real noise are beneficial for improving the sound source localization performance. Different from the speech enhancement task using simulated speech may not be problematic, the mismatch between real recordings and simulated RIRs causes the performance degradation of sound source localization, which is consistent with the findings in [40]. The mismatch between simulated and real noise has a great influence as well.

Table 5: CRNN benchmark experiments of sound source localization.
Training Data Static Speaker Moving Speaker
speech noise ACC(5°) [%] MAE [°] ACC(5°) [%] MAE [°]
sim sim 71.9 10.2 68.8 9.6
sim real 76.7 9.9 70.3 11.1
real sim 82.1 8.1 75.9 8.2
real real 88.4 4.6 83.9 4.3
Table 6: IPDnet variable-array experiments for sound source localization.
Network Static Speaker Moving Speaker
ACC(5°) [%] MAE [°] ACC(5°) [%] MAE [°]
Fixed-Array[12] 86.1 3.6 88.9 2.7
Variable-Array[12] 86.1 3.5 80.4 3.6

4.3 Variable-array networks and array generalization

End-to-end speech enhancement and source localization networks are normally array-dependent, which means although the previous networks trained using real speech and real noise achieve better performance for one array, they still cannot be used for other arrays. In this section, we use all the 28 microphones data (microphone 0 similar-to\sim 27) on the horizontal plane to train the variable-array networks, i.e. FaSNet-TAC [10] for speech enhancement and IPDnet [12] for sound source localization, to see whether the trained networks can be used as universal networks that can be directly used to unseen arrays.

Table 7: FaSNet-TAC variable-array experiments for speech enhancement.
Network Static Speaker Moving Speaker
WB-PESQ SI-SDR MOS-SIG MOS-BAK MOS-OVR CER WB-PESQ SI-SDR MOS-SIG MOS-BAK MOS-OVR CER
unprocessed 1.15 -9.8 2.00 1.72 1.51 19.9 1.11 -9.1 1.79 1.54 1.36 23.8
Fixed-Array 1.46 0.9 2.76 3.32 2.32 25.0 1.39 0.6 2.68 3.28 2.24 30.4
Variable-Array 1.40 -0.0 2.70 3.20 2.22 27.4 1.35 0.2 2.63 3.17 2.15 33.4

We set one test array, namely a 5-channel uniformly-spaced linear array (microphone 11, 3, 0, 7, and 12 in Fig. 1(b)). The training of variable-array networks uses randomly selected 2 similar-to\sim 8-channel sub-arrays, excluding all 5-channel uniformly-spaced linear arrays.

Table 7 and Table 6 presents the results of speech enhancement and sound source localization, respectively. It can be seen that, there are indeed certain performance losses when compared with the fixed-array networks that are trained using the test array, but the losses are relatively small. This show that the 32-channel real-recorded microphone array data provided in the proposed dataset can successfully train the variable-array networks, which offers a competitive solution to the simu-to-real problem of multi-channel speech enhancement and sound source localization.

5 Conclusion

This paper presents a new real-recorded and annotated microphone array speech and noise dataset, called RealMAN, for speech enhancement and localization. Baseline experiments demonstrate that training with our real-recorded data outperforms training with simulated data, effectively eliminating the simulation-to-real gap for speech enhancement and localization. The performance on our dataset can better reflect the capabilities of tested algorithms in real-world applications, providing a more reliable benchmark for speech enhancement and localization. Additionally, unified variable-array networks can be successfully trained using various sub-arrays of the proposed 32-channel microphone array. These trained variable-array networks have the potential to be applied directly to unseen arrays in real-world applications.

References

  • [1] Sharon Gannot, Emmanuel Vincent, Shmulik Markovich-Golan, and Alexey Ozerov. A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 25(4):692–730, 2017.
  • [2] DeLiang Wang and Jitong Chen. Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 26(10):1702–1726, 2018.
  • [3] Pierre-Amaury Grumiaux, Srđan Kitić, Laurent Girin, and Alexandre Guérin. A survey of sound source localization with deep learning methods. Journal of the Acoustical Society of America (JASA), 152(1):107–151, 2022.
  • [4] Jont B. Allen and Daivid A. Berkley. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America (JASA), 65(4):943–950, 1979.
  • [5] Emanuel A. P. Habets, Israel Cohen, and Sharon Gannot. Generating nonstationary multisensor signals under a spatial coherence constraint. The Journal of the Acoustical Society of America (JASA), 124(5):2911–2917, 2008.
  • [6] Prerak Srivastava, Antoine Deleforge, and Emmanuel Vincent. Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. In International Workshop on Acoustic Signal Enhancement (IWAENC), 2022.
  • [7] Eric A. Lehmann and Anders M. Johansson. Diffuse reverberation model for efficient image-source simulation of room impulse responses. IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 18(6):1429–1439, 2010.
  • [8] Weipeng He, Petr Motlicek, and Jean-Marc Odobez. Neural network adaptation and data augmentation for multi-speaker direction-of-arrival estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 29:1303–1317, 2021.
  • [9] Shunsuke Kita, Graduate, and Yoshinobu Kajikawa. Sound source localization inside a structure under semi-supervised conditions. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 31:1397–1408, 2023.
  • [10] Yi Luo, Zhuo Chen, Nima Mesgarani, and Takuya Yoshioka. End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6394–6398, May 2020.
  • [11] Siyuan Zhang and Xiaofei Li. Microphone array generalization for multichannel narrowband deep speech enhancement. In INTERSPEECH, pages 666–670, 2021.
  • [12] Yabo Wang, Bing Yang, and Xiaofei Li. IPDnet: A universal direct-path IPD estimation network for sound source localization. arXiv preprint arXiv:2405.07021, 2024.
  • [13] Elior Hadad, Florian Heese, Peter Vary, and Sharon Gannot. Multichannel audio database in various acoustic environments. In International Workshop on Acoustic Signal Enhancement, pages 313–317, 2014.
  • [14] Igor Szoke, Miroslav Skacel, Ladislav Mosner, Jakub Paliesek, and Jan (Honza) Cernocky. Building and evaluation of a real room impulse response dataset. IEEE Journal of Selected Topics in Signal Processing (JSTSP), 13(4):863–876, 2019.
  • [15] Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, and Tomohiro Nakatani. The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 1–4, 2013.
  • [16] Archontis Politis, Sharath Adavanne, and Tuomas Virtanen. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. In Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), pages 165–169, 2020.
  • [17] James Eaton, Nikolay D. Gaubitch, Alastair H. Moore, and Patrick A. Naylor. Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 24(10):1681–1693, 2016.
  • [18] Diego Di Carlo1, Pinchas Tandeitnik, CeFoy, Nancy Bertin, Antoine Deleforge, and Sharon Gannot. dEchorate: A calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, 2021(39):1–15, 2021.
  • [19] Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 504–511, 2015.
  • [20] Zhuo Chen, Takuya Yoshioka, Liang Lu, Tianyan Zhou, Zhong Meng, Yi Luo, Jian Wu, Xiong Xiao, and **yu Li. Continuous speech separation: Dataset and analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 7284–7288, 2020.
  • [21] Mike Lincoln, Iain McCowan, Jithendra Vepa, and Hari Krishna Maganti. The multi-channel wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 357–362, 2005.
  • [22] Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal. The fifth ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines. In INTERSPEECH, pages 1561–1565, 2018.
  • [23] I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, D. Reidsma, and P. Wellner. The AMI meeting corpus. In International Conference on Methods and Techniques in Behavioral Research, 2005.
  • [24] Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, and **gdong Chen. Aishell-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario. In INTERSPEECH, pages 3665–3669, 2021.
  • [25] Fan Yu, Shiliang Zhang, Yihui Fu, Lei Xie, Siqi Zheng, Zhihao Du, Weilong Huang, Pengcheng Guo, Zhijie Yan, Bin Ma, Xin Xu, and Hui Bu. M2met: The icassp 2022 multi-channel multi-party meeting transcription challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6167–6171, 2022.
  • [26] Heinrich W. Lollmann, Christine Evers, Alexander Schmidt, Heinrich Mellmann, Hendrik Barfuss, Patrick A. Naylor, and Walter Kellermann. The LOCATA challenge data corpus for acoustic source localization and tracking. In IEEE Sensor Array and Multichannel Signal Processing Workshop, pages 410–414, 2018.
  • [27] Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent. The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings. In Meetings on Acoustics, 2013.
  • [28] Changsheng Quan and Xiaofei Li. Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 32:1310–1323, 2024.
  • [29] Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, and Shinji Watanabe. Tf-gridnet: Integrating full- and sub-band modeling for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 31:3221–3236, 2023.
  • [30] Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6182–6186. IEEE, 2022.
  • [31] Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey. SDR – Half-baked or Well Done? In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630, May 2019.
  • [32] A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 749–752, 2001.
  • [33] Chandan K A Reddy, Vishak Gopal, and Ross Cutler. Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 886–890, 2022.
  • [34] Bing Yang, Hong Liu, and Xiaofei Li. Learning deep direct-path relative transfer function for binaural sound source localization. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 29:3491–3503, 2021.
  • [35] Bing Yang, Hong Liu, and Xiaofei Li. SRP-DNN: Learning direct-path phase difference for multiple moving sound source localization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2022.
  • [36] Lei Wang, Zhibin Jiao, Qiyong Zhao, Jie Zhu, and Yang Fu. Framewise multiple sound source localization and counting using binaural spatial audio signals. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
  • [37] Weipeng He, Petr Motlícek, and Jean-Marc Odobez. Joint localization and classification of multiple sound sources using a multi-task neural network. In INTERSPEECH, pages 312–316, 2018.
  • [38] Thi Ngoc Tho Nguyen, W. S. Gan, Rishabh Ranjan, and Douglas L. Jones. Robust source counting and doa estimation using spatial pseudo-spectrum and convolutional neural network. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 28:2626–2637, 2020.
  • [39] Pierre-Amaury Grumiaux, Srđan Kitić, Laurent Girin, and Alexandre Guérin. Improved feature extraction for crnn-based multiple sound source localization. 29th European Signal Processing Conference (EUSIPCO), pages 231–235, 2021.
  • [40] Bing Yang and Xiaofei Li. Self-supervised learning of spatial acoustic representation with cross-channel signal reconstruction and multi-channel conformer. arXiv preprint arXiv:2312.00476, 2023.
  • [41] Martin Holters, Tobias Corbach, and Udo Zölzer. Impulse response measurement techniques and their applicability in the real world. In Proceedings of the 12th International Conference on Digital Audio Effects (DAFx-09), pages 108–112, 2009.
  • [42] Charles H. Knapp and G. Clifford Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4):320–327, 1976.

Checklist

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

    2. (b)

      Did you describe the limitations of your work? [Yes] See Sec. 2. One unreal factor is that we use a loudspeaker playing back speech signals, instead of speaking by real human speakers.

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [Yes] There are no potential negative societal impacts. See supplementary materials.

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [N/A]

    2. (b)

      Did you include complete proofs of all theoretical results? [N/A]

  3. 3.

    If you ran experiments (e.g. for benchmarks)…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] URL.

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Sec. 3.5 and 4

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No]

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix. D.1

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [Yes]

    2. (b)

      Did you mention the license of the assets? [Yes] See the github URL

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [Yes] See the github URL

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes]

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] See supplementary materials

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Existing Microphone Array Speech Datasets

As a supplement to Table 1, additional existing datasets of microphone array speech recordings are listed in Table 8, which includes LibriCSS [20], MC-WSJ-AV [21], CHiME-5/-6/-7 [22], AMIMeeting [23], AISHELL-4 [24] and AliMeeting [25]. Though these datasets provide multi-channel speech recordings with speech transcriptions, they don’t provide any direct annotations for speech enhancement and source localization.

Table 8: Additional microphone array speech datasets.
Dataset Diversity, Quantity Main Data Microphone Array (×\times×1 by default)
# Scenes Scene type Source state speech duration
LibriCSS [20] 1 Meeting Static 10 h Recording, Transcription 7-ch circular
MC-WSJ-AV [21] 3 Meeting Static+moving - Recording, Transcription 8-ch circular (×\times×2)
CHiME-5/-6/-7 [22] 20 Home Moving 50 h Recording, Transcription Kinect (×\times×6), binaural pairs (×\times×4)
AMIMeeting [23] 3 Meeting - 100 h Recording, Transcription 8-ch circular, 8/4/10-ch
AISHELL-4 [24] 10 Meeting Static+ slightly moving 120 h Recording, Transcription 8-ch circular
AliMeeting [25] 21 Meeting Static 120 h Recording, Transcription 8-ch circular

Appendix B Dataset Details

B.1 Scene information

We present detailed information of acoustic scenes for RealMAN recording in Table 9. The scene names in this table are consistent with the names in our released datasets. The reverberation time, i.e. T60, of indoor and semi-outdoor scenes are provided. The T60s of most enclosed scenes are measured using our recording system with the exponential sine sweep signal (see details in Appendix B.2). For some scenes that we cannot move our devices into, the T60s are measured by a mobile phone application, which however could be inaccurate.

Table 9: Summary of recording scene information. The T60s of indoor scenes are measured by ourselves (see Appendix B.2 for measurement details), except that the underlined ones are measured by a less precise phone application.
Scene name Dataset Speech Noise T60 Category Room Size (m)
BarberShop Train ×\times× 0.653 Indoor -
Bus-Electric Train 0.283 Transportation -
Cafeteria2 Train ×\times× 1.242 Indoor 13.3×\times×9.2×\times×3.7
ConstructionPlant Train ×\times× - Outdoor -
LivingRoom3 Train ×\times× 0.376 Indoor 8.5×\times×5.3×\times×2.6
LivingRoom7 Train ×\times× 0.398 Indoor 12.9×\times×8.2×\times×2.2
LivingRoom9 Train ×\times× 0.562 Indoor 9.7×\times×9.6×\times×3.0
Park Train ×\times× - Outdoor -
PedestrianStreet Train ×\times× 0.336 Semi-outdoor -
Roadside1 Train ×\times× - Outdoor -
Roadside2 Train ×\times× - Outdoor -
Shop**Mall Train ×\times× 0.946 Indoor -
Subway-Electric Train ×\times× - Transportation -
Terrace1 Train ×\times× 0.355 Semi-outdoor -
Terrace2 Train ×\times× 0.605 Semi-outdoor -
UndergroundParking1 Train ×\times× 2.923 Indoor -
Auditorium Train,Validation ×\times× 1.179 Indoor -
BadmintonCourt1 Train,Validation 1.577 Indoor 28.6×\times×7.7×\times×8.6
Gym Train,Validation 1.504 Indoor 39.5×\times×8.5×\times×4.4
LivingRoom6 Train,Validation ×\times× 0.398 Indoor 3.6×\times×2.0×\times×2.6
LivingRoom8 Train,Validation ×\times× 0.399 Indoor 3.8×\times×2.0×\times×2.6
Market Train,Validation 1.228 Indoor 31.0×\times×30.0×\times×4.9
Cafeteria1 Train,Test 0.763 Indoor -
Car-Gasoline Train,Test 0.069 Transportation -
Classroom2 Train,Test ×\times× 0.679 Indoor 20.0×\times×16.4×\times×3.5
Classroom3 Train,Test ×\times× 1.358 Indoor 23.8×\times×6.9×\times×3.9
Library Train,Test ×\times× 1.180 Indoor radius=8.8
LivingRoom1 Train,Test 0.598 Indoor 9.7×\times×9.6×\times×3.0
LivingRoom2 Train,Test ×\times× 0.844 Indoor 4.5×\times×2.8×\times×2.9
LivingRoom4 Train,Test ×\times× 0.427 Indoor 7.7×\times×3.0×\times×2.0
LivingRoom5 Train,Test ×\times× 0.562 Indoor 4.2×\times×3.6×\times×2.9
OfficeRoom4 Train,Test ×\times× 0.888 Indoor 6.8×\times×5.1×\times×3.3
BasketballCourt1 Train,Validation,Test - Outdoor -
Car-Electric Train,Validation,Test 0.086 Transportation -
Classroom1 Train,Validation,Test* 0.816 Indoor 12.7×\times×7.6×\times×3.5
Laundry Train,Validation,Test ×\times× 0.534 Indoor 3.6×\times×2.6×\times×2.5
OfficeRoom3 Train,Validation,Test 1.295 Indoor 8.5×\times×6.0×\times×3.5
OfficeLobby Train,Validation,Test 2.476 Indoor -
SunkenPlaza1 Train,Validation,Test ×\times× 0.915 Semi-outdoor 27.6×\times×14.6
SubkenPlaza2 Train,Validation,Test ×\times× 0.915 Semi-outdoor -
BasketballCourt2 Validation - Outdoor -
Cafeteria3 Validation 0.843 Indoor -
OfficeRoom1 Validation ×\times× 0.719 Indoor 18.6×\times×10.5×\times×5.0
BadmintonCourt2 Test 1.694 Indoor 28.6×\times×7.7×\times×8.6
OfficeRoom2 Test 0.490 Indoor -
UndergroundParking2 Test 4.928 Indoor -

* The speech data of Classroom1 is split to the training set; And the corresponding noise data is split to train, validation and test sets.

B.2 T60 measurement

The T60 is measured based on the RIR measurement with an exponential sine sweep signal. Firstly, we generate repeated exponential sine sweeps with a frequency ranging from 200 Hz to 8 kHz and their inverse signals to measure the RIR [41]. Secondly, the energy decay curve of the measured RIR is calculated. The appropriate segment in the energy decay curve is selected to calculate the T20 using linear least-squares regression. The T60 is approximated as three times the T20. For one scene, we conduct multiple trials to measure the T60, and the mean value is taken as the final T60 measurement.

Appendix C Annotation Details

C.1 Details for target direct-path clean speech estimation

Input: Source clean speech utterance sdp(t)subscript𝑠𝑑𝑝𝑡{s}_{dp}(t)italic_s start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ), AlignSpatialNet output utterance s^dp(t)subscript^𝑠𝑑𝑝𝑡\hat{s}_{dp}(t)over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ).
1 Utilize the generalized cross-correlation (GCC) [42] between AlignSpatialNet output (8 kHz) and source clean speech (8 kHz) to estimate the time shift of the direct-path speech (8 kHz, 48 kHz);
2 Based on the power of AlignSpatialNet output (8 kHz) and time-aligned target clean speech (8 kHz) to estimate the attenuation of target clean speech;
3 Generate the target direct-path clean speech (48 kHz) with the estimated time shift and level attenuation;
4 Manually check the target clean speech (48 kHz) and corresponding recordings (48 kHz), and the speech utterance will be discarded if it is unreliable.
Output: Target direct-path clean speech utterance (48 kHz).
Algorithm 1 Target clean speech estimation for static speaker

It is difficult to accurately estimate A𝐴Aitalic_A and τ𝜏\tauitalic_τ from noisy and reverberant speech recordings, due to the signal distortion caused by reverberation and noise. Therefore, we leverage a prior speech enhancement step (using the AlignSpatialNet presented in Appendix C.2) which gives a rough estimation of the direct-path speech, denoted as s^dp(t)subscript^𝑠𝑑𝑝𝑡\hat{s}_{dp}(t)over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ), which is supposed to be roughly time- and level-aligned with sdp(t)subscript𝑠𝑑𝑝𝑡{s}_{dp}(t)italic_s start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ). Note that s^dp(t)subscript^𝑠𝑑𝑝𝑡\hat{s}_{dp}(t)over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ) is not used as the wanted target clean speech directly, because its interpretability and accuracy are not good enough. Alternatively, we calculate the generalized cross-correlation (GCC) [42] between hdevs(t)subscript𝑑𝑒𝑣𝑠𝑡h_{dev}*s(t)italic_h start_POSTSUBSCRIPT italic_d italic_e italic_v end_POSTSUBSCRIPT ∗ italic_s ( italic_t ) and s^dp(t)subscript^𝑠𝑑𝑝𝑡\hat{s}_{dp}(t)over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ) to estimate the time shift τ𝜏\tauitalic_τ (at the sample level) by finding the maximum peak of GCC. For static sources, A𝐴Aitalic_A and τ𝜏\tauitalic_τ are time-invariant, and we estimate them at the utterance level. As for moving sources, A𝐴Aitalic_A and τ𝜏\tauitalic_τ are time-varying. Because GCC is a statistical value, the estimation at the sample level is intractable. Therefore, we clip one speech utterance into overlap** segments and estimate τ𝜏\tauitalic_τ at the segment level. The segment length of direct-path speech is determined by the time delay estimate of consecutive segments, and then the segments of source speech are stretched or compressed (by resampling operation) to align with the segments of direct-path speech. After that, the estimation of segment-level A𝐴Aitalic_A can also be done. Note that, data cleansing and nonlinear interpolation operations are adopted in both τ𝜏\tauitalic_τ and A𝐴Aitalic_A estimations. After all, τ𝜏\tauitalic_τ and A𝐴Aitalic_A estimates for moving sources are at the segment and sample levels, respectively. The details for target clean speech estimation of static and moving sources are shown in Algorithm 1 and 2, respectively. For simplicity, the sampling rate of the speech signal is marked in the parentheses and the downsampling operation is omitted.

Input: Source clean speech utterance sdp(t)subscript𝑠𝑑𝑝𝑡{s}_{dp}(t)italic_s start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ), AlignSpatialNet output utterance s^dp(t)subscript^𝑠𝑑𝑝𝑡\hat{s}_{dp}(t)over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d italic_p end_POSTSUBSCRIPT ( italic_t ).
1 Divide the speech utterance into 1-second segments with an overlap of 50%;
2 for segment in the utterance do
3       Utilize the GCC between segments in AlignSpatialNet output (8 kHz) and source speech (8 kHz) to estimate the time shift of the direct-path speech;
4      
5 end for
6Cleanse the time shift estimates by removing unreasonable values;
7 Utilize the Pchip interpolator to replace the unreasonable time shift estimates and generate a smooth time shift sequence for all segments;
8 for segment in the utterance do
9       Based on the time shift estimates, stretch the source segment by resampling operation (to mimic the Doppler frequency shift effect caused by speaker movement) and generate the time-aligned direct-path speech segment (8 kHz, 48 kHz);
10       Based on the power of AlignSpatialNet output (8 kHz) and time-aligned direct-path speech segment (8 kHz) to estimate the attenuation of direct-path speech segment;
11      
12 end for
13Cleanse the attenuation estimates by removing unreasonable values;
14 Utilize the Pchip interpolator to replace the unreasonable attenuation estimates and generate a smooth attenuation sequence for all sampling points;
15 for sampling point in time-aligned direct-path speech (48 kHz) do
16       Apply the attenuation estimates to the sampling point;
17      
18 end for
19Concatenate the direct-path speech sampling points to generate the direct-path speech utterance (48 kHz);
20 Manually check the direct-path speech (48 kHz) and corresponding recordings (48 kHz), and the speech utterance will be discarded if it is unreliable.
Output: Target direct-path clean speech utterance (48 kHz)
Algorithm 2 Target clean speech estimation for moving speaker

Since there is no available ground truth, we illustrate an example of GCC in Fig. 2(a), and an example of segment-level time shift and sample-level attenuation estimates in Fig. 2(b) for judging the credibility of the estimation of A𝐴Aitalic_A and τ𝜏\tauitalic_τ. The single sharp peak of GCC indicates the reliability of the time shift estimation. Note that if the GCC of one utterance for the static speaker (or one segment for the moving speaker) does not present such a sharp peak, the utterance for the static speaker will be abandoned (or the segment-level time shift estimation for the moving speaker will be abandoned, and recovered by interpolation). The smoothness and consistency of A𝐴Aitalic_A and τ𝜏\tauitalic_τ estimates indicate the reliability of A𝐴Aitalic_A and τ𝜏\tauitalic_τ estimates for moving speakers.

Refer to caption
(a) GCC output for a static speaker.
Refer to caption
(b) Time shift and gain estimates for a moving speaker.
Figure 2: Examples of intermediate results for target direct-path clean speech estimation.

C.2 Network architecture for source-signal guided direct-path speech estimation

Refer to caption
Figure 3: The architecture of the proposed AlignSpatialNet.

We propose a network named AlignSpatialNet for source-signal guided direct-path speech estimation. As shown in Fig. 3, the network has two branches, one for the recorded microphone signals and one for the played source clean speech signal. Each branch is extended from our previously proposed SpatialNet [28]. Besides the neural components from SpatialNet, i.e. the cross-band block, the multi-head self-attention module (MHSA) and the time-convolutional feedforward network (T-ConvFFN), each branch also contains multi-head cross-attention module (MHCA) for extracting the spatial cues, i.e. time shift and level attenuation, between the played source signal and the recorded microphone signals.

The MHCA module takes the hidden representations of microphone signals denoted as hmF×T×Hsubscripth𝑚superscript𝐹𝑇𝐻\textbf{h}_{m}\in\mathbb{R}^{F\times T\times H}h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_T × italic_H end_POSTSUPERSCRIPT and source signal denoted as hsF×T×Hsubscripth𝑠superscript𝐹𝑇𝐻\textbf{h}_{s}\in\mathbb{R}^{F\times T\times H}h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_T × italic_H end_POSTSUPERSCRIPT as input, where F𝐹Fitalic_F, T𝑇Titalic_T and H𝐻Hitalic_H denote the number of frequencies, time frames, and hidden units. The cross-attention in MHCA can be formulated as:

hq[f,t,:]Attention(hq[f,t,:],hk[f,tl:tl,:],hv[f,tl:tl,:])\textbf{h}_{q}[f,t,:]\leftarrow\text{Attention}(\textbf{h}_{q}[f,t,:],\textbf{% h}_{k}[f,t-l:t-l,:],\textbf{h}_{v}[f,t-l:t-l,:])h start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_f , italic_t , : ] ← Attention ( h start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_f , italic_t , : ] , h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_f , italic_t - italic_l : italic_t - italic_l , : ] , h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT [ italic_f , italic_t - italic_l : italic_t - italic_l , : ] )

where hqsubscripth𝑞\textbf{h}_{q}h start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, hksubscripth𝑘\textbf{h}_{k}h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and hvsubscripth𝑣\textbf{h}_{v}h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are respectively the query, key and value vectors, which correspond to hssubscripth𝑠\textbf{h}_{s}h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, hmsubscripth𝑚\textbf{h}_{m}h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and hmsubscripth𝑚\textbf{h}_{m}h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in source branch and correspond to hmsubscripth𝑚\textbf{h}_{m}h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, hssubscripth𝑠\textbf{h}_{s}h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and hssubscripth𝑠\textbf{h}_{s}h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in recording branch. f𝑓fitalic_f and t𝑡titalic_t denote the frequency index and frame index, and l𝑙litalic_l is a preset value for the maximum number of frame shifts in one layer, which is set to a value corresponding to 200 ms in our experiment.

The proposed AlignSpatialNet is trained with simulated data. Note that, for moving sources, when the moving speed is high, the simulated direct-path signals may have some clicking noise and other audible artifacts [7]. To mitigate this problem, the speech signals at two adjacent locations are half-overlapped and applied with a trapezium window for both generating the reverberant signal and direct-path signal in moving case. The simulated reverberant microphone signals are added with our real-recorded multichannel noise signals with signal-to-noise ratio (SNR) sampled in [5,20]520[5,20][ 5 , 20 ] dB. In training, the negative of SNR is used as the loss function.

C.3 Details for LED light detection with eyefish camera

Algorithm 3 provides the pseudocode for the LED light detection algorithm, Fig. 4 shows four examples of the images captured by the fisheye camera and the LED light detection results (red boxes).

Input: Eyefish camera video
1 for frames in video do
2       Convert video frame to HSV domain;
3       Create a mask for red/green color regions;
4       Apply mask to the HSV frame;
5       Initialize a score map;
6       for pixel in masked region do
7             Within the score map, calculate the pixel score based on color intensity and brightness;
8            
9       end for
10      Find the pixel of maximum score in the score map, as the detected position of LED light (source source);
11      
12 end for
Output: Position of the sound source
Algorithm 3 Estimating Sound Source Position from Eyefish Camera Video
Refer to caption
Figure 4: Examples of the vision-based LED light detection in different recording scenes.

Appendix D Experiments

D.1 Details for experimental configuration

(Section 4.2) All speech enhancement and sound source localization networks are trained on 16 kHz 4-second speech clips, and tested on 16 kHz speech utterances.

For speech enhancement networks, their training configurations are set the same as in their original papers. Each model is trained for 50 epochs, and the best checkpoint is selected according to the DNSMOS score of validation data.

For the sound source localization task, the window length of STFT is 512 samples (32 ms) with a frame shift of 256 samples (16 ms). The model outputs a localization result every 6 frames. The Adam is used as the optimizer for training. The batch size of the fixed-array model and variable-array model are set to 16 and 4, respectively. The learning rate is initially set to 0.0005, and exponentially decays with a decaying factor of 0.975. We train each model for 50 epochs and the best checkpoint is selected according to the validation loss.

D.2 Analysis of the spatial coherence of real multi-channel noise

One of the major differences between simulated diffuse noise and real-world recorded noise manifests as the spatial coherence [1]. We show the spatial correlation coefficient of the noise signals recorded with a 6cm-spaced microphone pair in some representative scenes, and the theoretical spherical (3D diffuse) noise field in Fig. 5. It can be seen that the spatial correlation of real noise largely varys from scene to scene. Some scenes, such as the ‘Market’, have a noise field that is fairly close to the 3D diffuse noise field. However, for many other scenes, the noise field is a complicated combination of directional noise sources and (partially) diffuse noise sources, and thus largely deviate from the 3D diffuse noise field. To demonstrate the temporal stationarity in terms of the spatial correlation of real noise, we plot the curve of spatial correlation as a function of time in Fig. 6, where the spatial correlations are estimated from consecutive 1-second recordings. This figure shows that, even for the ‘Market’ scene that has an overall 3D diffuse noise field, the spatial correlation is still highly time varying.

Refer to caption
(a) Market.
Refer to caption
(b) LivingRoom.
Refer to caption
(c) Park.
Refer to caption
(d) OfficeLobby.
Figure 5: Spatial correlation coefficients of multi-channel noise in typical scenes.
Refer to caption
Figure 6: The (absolute value of) spatial correlation as a function of time, at the frequency of 2 kHz in the ’Market’ scene.

D.3 Experiments results under high-SNR conditions

D.3.1 Benchmark experiments

In addition to the low-SNR conditions presented in Section 4.2 where the recorded speech and noise are added, in this section, we present the results for high-SNR conditions where only the recorded speech (without adding extra noise) are tested. Table 10 and 11 show the performance of speech enhancement and sound source localization, respectively. Under the high-SNR conditions, noise is no longer a crucial factor, and accordingly the networks trained with real noise and simulated noise achieve comparable performance measures.

D.3.2 Variable-array networks and array generalization

Table 12 and Table 13 show the variable-array experiments for the high-SNR case. Similar conclusions can be drawn as for the low-SNR experiments.

Table 10: Benchmark experiments of speech enhancement under high-SNR conditions.
Baseline Traning Data Static Speaker Moving Speaker
speech noise WB-PESQ SI-SDR MOS-SIG MOS-BAK MOS-OVR CER WB-PESQ SI-SDR MOS-SIG MOS-BAK MOS-OVR CER
unprocessed - - 1.35 -4.7 2.58 2.36 1.99 10.1 1.21 -4.9 2.28 2.00 1.70 10.5
FaSNet-TAC [10] sim sim 1.74 -0.3 2.95 3.74 2.61 13.3 1.62 -0.4 2.86 3.63 2.49 14.2
sim real 1.89 -0.0 3.04 3.69 2.67 11.2 1.72 0.1 2.94 3.57 2.54 12.2
real sim 1.91 4.7 2.93 3.75 2.60 12.9 1.68 2.6 2.79 3.59 2.42 14.6
real real 1.97 4.6 2.99 3.75 2.65 12.0 1.73 2.5 2.87 3.62 2.50 12.9
SpatialNet [28] sim sim 1.97 -2.7 3.39 3.30 2.74 9.1 1.76 -2.7 3.35 3.16 2.65 9.8
sim real 1.84 -0.1 3.05 3.00 2.41 10.4 1.70 -0.1 2.98 2.84 2.30 11.2
real sim 2.74 7.8 3.24 3.66 2.82 8.9 2.39 5.3 3.15 3.53 2.68 9.8
real real 2.71 8.0 3.22 3.78 2.87 9.0 2.35 5.1 3.13 3.71 2.75 10.4
Table 11: CRNN benchmark experiments of sound source localization under high-SNR conditions.
Training Data Static Speaker Moving Speaker
speech noise ACC(5°) [%] MAE [°] ACC(5°) [%] MAE [°]
sim sim 84.5 4.3 84.3 2.8
sim real 87.8 4.3 85.4 3.8
real sim 94.4 3.6 91.3 2.3
real real 92.7 3.5 91.4 2.3
Table 12: FaSNet-TAC variable-array experiments for speech enhancement under high-SNR conditions.
Setting Static Speaker Moving Speaker
WB-PESQ SI-SDR MOS-SIG MOS-BAK MOS-OVR CER WB-PESQ SI-SDR MOS-SIG MOS-BAK MOS-OVR CER
unprocessed 1.35 -4.7 2.58 2.36 1.99 10.1 1.21 -4.9 2.28 2.00 1.70 10.5
Fixed-Array 1.87 4.1 2.95 3.77 2.63 13.3 1.66 2.0 2.83 3.64 2.47 15.5
Variable-Array 1.73 3.0 2.84 3.68 2.50 15.5 1.58 1.6 2.73 3.54 2.35 18.1
Table 13: IPDnet variable-array experiments for sound source localization under high-SNR conditions.
Setting Static Speaker Moving Speaker
ACC(5°) [%] MAE [°] ACC(5°) [%] MAE [°]
Fixed-Array[12] 86.8 2.9 93.8 1.7
Variable-Array[12] 86.3 3.2 85.7 2.7