Search | arXiv e-print repository

Fast processing explains the effect of sound reflection on binaural unmasking

Authors: Norbert Kolotzek, Pierre G. Aublin, Bernhard U. Seeber

Abstract: Sound reflections and late reverberation alter energetic and binaural cues of a target source, thereby affecting it's detection in noise. Two experiments investigated detection of harmonic complex tones, centered around 500 Hz, in noise in a virtual room with different modifications of simulated room impulse responses (RIR). Stimuli were auralized using the SOFE's loudspeakers in anechoic space. T… ▽ More Sound reflections and late reverberation alter energetic and binaural cues of a target source, thereby affecting it's detection in noise. Two experiments investigated detection of harmonic complex tones, centered around 500 Hz, in noise in a virtual room with different modifications of simulated room impulse responses (RIR). Stimuli were auralized using the SOFE's loudspeakers in anechoic space. The target was presented from the front or at 0$^\circ$ azimuth, while an anechoic noise masker was simultaneously presented at 0$^\circ$. In the first experiment, early reflections were progressively added to the RIR and detection thresholds of the reverberant target were measured. For a frontal sound source, detection thresholds decreased while adding the first 45 ms of early reflections, whereas for a lateral sound source thresholds remained constant. In the second experiment, early reflections were cut out while late reflections were kept along with the direct sound. Results for a target at 0$^\circ$ show that even reflections as late as 150 ms reduce detection thresholds compared to only the direct sound. A binaural model with a sluggishness component following the computation of binaural unmasking in short windows predicts measured and literature results better than when large windows are used. △ Less

Submitted 30 June, 2021; originally announced June 2021.

Comments: Preprint from June 2nd , 2021

arXiv:2106.15916 [pdf]

Communication conditions in virtual acoustic scenes in an underground station

Authors: Ľuboš Hládek, Stephan D. Ewert, Bernhard U. Seeber

Abstract: Underground stations are a common communication situation in towns: we talk with friends or colleagues, listen to announcements or shop for titbits while background noise and reverberation are challenging communication. Here, we perform an acoustical analysis of two communication scenes in an underground station in Munich and test speech intelligibility. The acoustical conditions were measured in… ▽ More Underground stations are a common communication situation in towns: we talk with friends or colleagues, listen to announcements or shop for titbits while background noise and reverberation are challenging communication. Here, we perform an acoustical analysis of two communication scenes in an underground station in Munich and test speech intelligibility. The acoustical conditions were measured in the station and are compared to simulations in the real-time Simulated Open Field Environment (rtSOFE). We compare binaural room impulse responses measured with an artificial head in the station to modeled impulse responses for free-field auralization via 60 loudspeakers in the rtSOFE. We used the image source method to model early reflections and a set of multi-microphone recordings to model late reverberation. The first communication scene consists of 12 equidistant (1.6 m) horizontally spaced source positions around a listener, simulating different direction-dependent spatial unmasking conditions. The second scene mimics an approaching speaker across six radially spaced source positions (from 1 m to 10 m) with varying direct sound level and thus direct-to-reverberant energy. The acoustic parameters of the underground station show a moderate amount of reverberation (T30 in octave bands was between 2.3 s and 0.6 s and early-decay times between 1.46 s and 0.46 s). The binaural and energetic parameters of the auralization were in a close match to the measurement. Measured speech reception thresholds were within the error of the speech test, letting us to conclude that the auralized simulation reproduces acoustic and perceptually relevant parameters for speech intelligibility with high accuracy. △ Less

Submitted 2 November, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

Comments: I3DA conference paper, 8 figures, 9 pages

arXiv:2106.15909 [pdf]

doi 10.1109/I3DA48870.2021.9610916

Effect of acoustic scene complexity and visual scene representation on auditory perception in virtual audio-visual environments

Authors: Stefan Fichna, Thomas Biberger, Bernhard U. Seeber, Stephan D. Ewert

Abstract: In daily life, social interaction and acoustic communication often take place in complex acoustic environments (CAE) with a variety of interfering sounds and reverberation. For hearing research and the evaluation of hearing systems, simulated CAEs using virtual reality techniques have gained interest in the context of ecological validity. In the current study, the effect of scene complexity and vi… ▽ More In daily life, social interaction and acoustic communication often take place in complex acoustic environments (CAE) with a variety of interfering sounds and reverberation. For hearing research and the evaluation of hearing systems, simulated CAEs using virtual reality techniques have gained interest in the context of ecological validity. In the current study, the effect of scene complexity and visual representation of the scene on psychoacoustic measures like sound source location, distance perception, loudness, speech intelligibility, and listening effort in a virtual audio-visual environment was investigated. A 3-dimensional, 86-channel loudspeaker array was used to render the sound field in combination with or without a head-mounted display (HMD) to create an immersive stereoscopic visual representation of the scene. The scene consisted of a ring of eight (virtual) loudspeakers which played a target speech stimulus and nonsense speech interferers in several spatial conditions. Either an anechoic (snowy outdoor scenery) or echoic environment (loft apartment) with a reverberation time (T60) of about 1.5 s was simulated. In addition to varying the number of interferers, scene complexity was varied by assessing the psychoacoustic measures in isolated consecutive measurements orcsimultaneously. Results showed no significant effect of wearing the HMD on the data. Loudness and distance perception showed significantly different results when they were measured simultaneously instead of consecutively in isolation. The advantage of the suggested setup is that it can be directly transferred to a corresponding real room, enabling a 1:1 comparison and verification of the perception experiments in the real and virtual environment. △ Less

Submitted 7 November, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

Comments: Accepted publication in Proceedings of 3DA 2021 International Conference on Immersive and 3D Audio

arXiv:2007.12892 [pdf, ps, other]

MP3 Compression To Diminish Adversarial Noise in End-to-End Speech Recognition

Authors: Iustina Andronic, Ludwig Kürzinger, Edgar Ricardo Chavez Rosas, Gerhard Rigoll, Bernhard U. Seeber

Abstract: Audio Adversarial Examples (AAE) represent specially created inputs meant to trick Automatic Speech Recognition (ASR) systems into misclassification. The present work proposes MP3 compression as a means to decrease the impact of Adversarial Noise (AN) in audio samples transcribed by ASR systems. To this end, we generated AAEs with the Fast Gradient Sign Method for an end-to-end, hybrid CTC-attenti… ▽ More Audio Adversarial Examples (AAE) represent specially created inputs meant to trick Automatic Speech Recognition (ASR) systems into misclassification. The present work proposes MP3 compression as a means to decrease the impact of Adversarial Noise (AN) in audio samples transcribed by ASR systems. To this end, we generated AAEs with the Fast Gradient Sign Method for an end-to-end, hybrid CTC-attention ASR system. Our method is then validated by two objective indicators: (1) Character Error Rates (CER) that measure the speech decoding performance of four ASR models trained on uncompressed, as well as MP3-compressed data sets and (2) Signal-to-Noise Ratio (SNR) estimated for both uncompressed and MP3-compressed AAEs that are reconstructed in the time domain by feature inversion. We found that MP3 compression applied to AAEs indeed reduces the CER when compared to uncompressed AAEs. Moreover, feature-inverted (reconstructed) AAEs had significantly higher SNRs after MP3 compression, indicating that AN was reduced. In contrast to AN, MP3 compression applied to utterances augmented with regular noise resulted in more transcription errors, giving further evidence that MP3 encoding is effective in diminishing only AN. △ Less

Submitted 25 July, 2020; originally announced July 2020.

Comments: Submitted and accepted at SPECOM 2020 conference

Showing 1–4 of 4 results for author: Seeber, B U