Showing 1–2 of 2 results for author: Fuhs, M C

Search v0.5.6 released 2020-02-24

arXiv:2406.17124 [pdf, other]

cs.SD cs.LG eess.AS

Investigating Confidence Estimation Measures for Speaker Diarization

Authors: Anurag Chowdhury, Abhinav Misra, Mark C. Fuhs, Monika Woszczyna

Abstract: Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlap** speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speec… ▽ More Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlap** speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speech recognition. One of the ways to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems. In this work, we investigate multiple methods for generating diarization confidence scores, including those derived from the original diarization system and those derived from an external model. Our experiments across multiple datasets and diarization systems demonstrate that the most competitive confidence score methods can isolate ~30% of the diarization errors within segments with the lowest ~10% of confidence scores. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted in INTERSPEECH 2024
arXiv:2406.08914 [pdf, other]

cs.SD cs.LG eess.AS

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

Authors: William Ravenscroft, George Close, Stefan Goetze, Thomas Hain, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

Abstract: One solution to automatic speech recognition (ASR) of overlap** speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domai… ▽ More One solution to automatic speech recognition (ASR) of overlap** speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI). △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 5 pages, 3 Figures, 3 Tables, Accepted for Interspeech 2024

Search v0.5.6 released 2020-02-24