Search | arXiv e-print repository

Two vs. Four-Channel Sound Event Localization and Detection

Authors: Julia Wilkins, Magdalena Fuentes, Luca Bondi, Shabnam Ghaffarzadegan, Ali Abavisani, Juan Pablo Bello

Abstract: Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devi… ▽ More Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devices rarely are able to record using more than two channels. For this reason, in this work we investigate the performance of the DCASE 2022 SELD baseline model using three audio input representations: FOA, binaural, and stereo. We perform a novel comparative analysis illustrating the effect of these audio input representations on SELD performance. Crucially, we show that binaural and stereo (i.e. 2-channel) audio-based SELD models are still able to localize and detect sound sources laterally quite well, despite overall performance degrading as less audio information is provided. Further, we segment our analysis by scenes containing varying degrees of sound source polyphony to better understand the effect of audio input representation on localization and detection performance as scene conditions become increasingly complex. △ Less

Submitted 23 September, 2023; originally announced September 2023.

arXiv:2201.11207 [pdf, other]

Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

Authors: Piotr Żelasko, Siyuan Feng, Laureano Moro Velazquez, Ali Abavisani, Saurabhchand Bhati, Odette Scharenborg, Mark Hasegawa-Johnson, Najim Dehak

Abstract: The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. Wh… ▽ More The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. While it has been shown that the pooling of resources from multiple languages is helpful, we have not yet seen a successful application of an ASR model to a language unseen during training. A crucial step in the adaptation of ASR from seen to unseen languages is the creation of the phone inventory of the unseen language. The ultimate goal of our work is to build the phone inventory of a language unseen during training in an unsupervised way without any knowledge about the language. In this paper, we 1) investigate the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language; 2) provide an analysis of which phones transfer well across languages and which do not in order to understand the limitations of and areas for further improvement for automatic phone inventory creation; and 3) present different methods to build a phone inventory of an unseen language in an unsupervised way. To that end, we conducted mono-, multi-, and crosslingual experiments on a set of 13 phonetically diverse languages and several in-depth analyses. We found a number of universal phone tokens (IPA symbols) that are well-recognized cross-linguistically. Through a detailed analysis of results, we conclude that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery. △ Less

Submitted 27 January, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

Comments: Accepted for publication in Computer Speech and Language

arXiv:2010.12104 [pdf, other]

doi 10.1109/ICASSP39728.2021.9414478

How Phonotactics Affect Multilingual and Zero-shot ASR Performance

Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Abstract: The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successfu… ▽ More The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable. △ Less

Submitted 10 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

Comments: Accepted for publication in IEEE ICASSP 2021. The first 2 authors contributed equally to this work

arXiv:2005.06065 [pdf, ps, other]

doi 10.21437/Interspeech.2020-2121

Automatic Estimation of Intelligibility Measure for Consonants in Speech

Authors: Ali Abavisani, Mark Hasegawa-Johnson

Abstract: In this article, we provide a model to estimate a real-valued measure of the intelligibility of individual speech segments. We trained regression models based on Convolutional Neural Networks (CNN) for stop consonants \textipa{/p,t,k,b,d,g/} associated with vowel \textipa{/A/}, to estimate the corresponding Signal to Noise Ratio (SNR) at which the Consonant-Vowel (CV) sound becomes intelligible fo… ▽ More In this article, we provide a model to estimate a real-valued measure of the intelligibility of individual speech segments. We trained regression models based on Convolutional Neural Networks (CNN) for stop consonants \textipa{/p,t,k,b,d,g/} associated with vowel \textipa{/A/}, to estimate the corresponding Signal to Noise Ratio (SNR) at which the Consonant-Vowel (CV) sound becomes intelligible for Normal Hearing (NH) ears. The intelligibility measure for each sound is called SNR$_{90}$, and is defined to be the SNR level at which human participants are able to recognize the consonant at least 90\% correctly, on average, as determined in prior experiments with NH subjects. Performance of the CNN is compared to a baseline prediction based on automatic speech recognition (ASR), specifically, a constant offset subtracted from the SNR at which the ASR becomes capable of correctly labeling the consonant. Compared to baseline, our models were able to accurately estimate the SNR$_{90}$~intelligibility measure with less than 2 [dB$^2$] Mean Squared Error (MSE) on average, while the baseline ASR-defined measure computes SNR$_{90}$~with a variance of 5.2 to 26.6 [dB$^2$], depending on the consonant. △ Less

Submitted 28 June, 2020; v1 submitted 12 May, 2020; originally announced May 2020.

Comments: 5 pages, 1 figure, 7 tables, submitted to Inter Speech 2020 Conference

arXiv:1908.04751 [pdf, other]

doi 10.1121/1.5101104

The role of cue enhancement and frequency fine-tuning in hearing impaired phone recognition

Authors: Ali Abavisani, Mark A Hasegawa-Johnson

Abstract: A speech-based hearing test is designed to identify the susceptible error-prone phones for individual hearing impaired (HI) ear. Only robust tokens in the experiment noise levels had been chosen for the test. The noise-robustness of tokens is measured as SNR90 of the token, which is the signal to the speech-weighted noise ratio where a normal hearing (NH) listener would recognize the token with an… ▽ More A speech-based hearing test is designed to identify the susceptible error-prone phones for individual hearing impaired (HI) ear. Only robust tokens in the experiment noise levels had been chosen for the test. The noise-robustness of tokens is measured as SNR90 of the token, which is the signal to the speech-weighted noise ratio where a normal hearing (NH) listener would recognize the token with an accuracy of 90% on average. Two sets of tokens T1 and T2 having the same consonant-vowels but different talkers with distinct SNR90 had been presented with flat gain at listeners' most comfortable level. We studied the effects of frequency fine-tuning of the primary cue by presenting tokens of the same consonant but different vowels with similar SNR90. Additionally, we investigated the role of changing the intensity of primary cue in HI phone recognition, by presenting tokens from both sets T1 and T2. On average, 92% of tokens are improved when we replaced the CV with the same CV but with a more robust talker. Additionally, using CVs with similar SNR90, on average, tokens are improved by 75%, 71%, 63%, and 72%, when we replaced vowels /A, ae, I, E/, respectively. The confusion pattern in each case provides insight into how these changes affect the phone recognition in each HI ear. We propose to prescribe hearing aid amplification tailored to individual HI ears, based on the confusion pattern, the response from cue enhancement, and the response from frequency fine-tuning of the cue. △ Less

Submitted 9 August, 2019; originally announced August 2019.

Comments: 16 pages, 10 figures, proceedings of the Acoustical Society of America meeting, May 2019, Louisville, KY

Showing 1–5 of 5 results for author: Abavisani, A