-
A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation
Authors:
Karn N. Watcharasupat,
Chih-Wei Wu,
Yiwei Ding,
Iroro Orife,
Aaron J. Hipple,
Phillip A. Williams,
Scott Kramer,
Alexander Lerch,
William Wolcott
Abstract:
Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions whic…
▽ More
Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
△ Less
Submitted 1 December, 2023; v1 submitted 5 September, 2023;
originally announced September 2023.
-
ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus
Authors:
Tolulope Ogunremi,
Kola Tubosun,
Anuoluwapo Aremu,
Iroro Orife,
David Ifeoluwa Adelani
Abstract:
We introduce ÌròyìnSpeech, a new corpus influenced by the desire to increase the amount of high quality, contemporary Yorùbá speech data, which can be used for both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. We curated about 23000 text sentences from news and creative writing domains with the open license CC-BY-4.0. To encourage a participatory approach to data creation, we…
▽ More
We introduce ÌròyìnSpeech, a new corpus influenced by the desire to increase the amount of high quality, contemporary Yorùbá speech data, which can be used for both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. We curated about 23000 text sentences from news and creative writing domains with the open license CC-BY-4.0. To encourage a participatory approach to data creation, we provide 5000 curated sentences to the Mozilla Common Voice platform to crowd-source the recording and validation of Yorùbá speech data. In total, we created about 42 hours of speech data recorded by 80 volunteers in-house, and 6 hours of validated recordings on Mozilla Common Voice platform. Our TTS evaluation suggests that a high-fidelity, general domain, single-speaker Yorùbá voice is possible with as little as 5 hours of speech. Similarly, for ASR we obtained a baseline word error rate (WER) of 23.8.
△ Less
Submitted 27 March, 2024; v1 submitted 29 July, 2023;
originally announced July 2023.
-
Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning
Authors:
Nikhil Singh,
Chih-Wei Wu,
Iroro Orife,
Mahdi Kalayeh
Abstract:
Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we…
▽ More
Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video. Our results, from a comprehensive set of experiments investigating different training strategies, show this general approach improves performance on a range of downstream auditory and audiovisual tasks, without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks.
△ Less
Submitted 8 June, 2024; v1 submitted 12 April, 2023;
originally announced April 2023.
-
BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus
Authors:
Josh Meyer,
David Ifeoluwa Adelani,
Edresson Casanova,
Alp Öktem,
Daniel Whitenack Julian Weber,
Salomon Kabongo,
Elizabeth Salesky,
Iroro Orife,
Colin Leong,
Perez Ogayo,
Chris Emezue,
Jonathan Mukiibi,
Salomey Osei,
Apelete Agbolo,
Victor Akinode,
Bernard Opoku,
Samuel Olanrewaju,
Jesujoba Alabi,
Shamsuddeen Muhammad
Abstract:
BibleTTS is a large, high-quality, open speech dataset for ten languages spoken in Sub-Saharan Africa. The corpus contains up to 86 hours of aligned, studio quality 48kHz single speaker recordings per language, enabling the development of high-quality text-to-speech models. The ten languages represented are: Akuapem Twi, Asante Twi, Chichewa, Ewe, Hausa, Kikuyu, Lingala, Luganda, Luo, and Yoruba.…
▽ More
BibleTTS is a large, high-quality, open speech dataset for ten languages spoken in Sub-Saharan Africa. The corpus contains up to 86 hours of aligned, studio quality 48kHz single speaker recordings per language, enabling the development of high-quality text-to-speech models. The ten languages represented are: Akuapem Twi, Asante Twi, Chichewa, Ewe, Hausa, Kikuyu, Lingala, Luganda, Luo, and Yoruba. This corpus is a derivative work of Bible recordings made and released by the Open.Bible project from Biblica. We have aligned, cleaned, and filtered the original recordings, and additionally hand-checked a subset of the alignments for each language. We present results for text-to-speech models with Coqui TTS. The data is released under a commercial-friendly CC-BY-SA license.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
Learning Nigerian accent embeddings from speech: preliminary results based on SautiDB-Naija corpus
Authors:
Tejumade Afonja,
Oladimeji Mudele,
Iroro Orife,
Kenechi Dukor,
Lawrence Francis,
Duru Goodness,
Oluwafemi Azeez,
Ademola Malomo,
Clinton Mbataku
Abstract:
This paper describes foundational efforts with SautiDB-Naija, a novel corpus of non-native (L2) Nigerian English speech. We describe how the corpus was created and curated as well as preliminary experiments with accent classification and learning Nigerian accent embeddings. The initial version of the corpus includes over 900 recordings from L2 English speakers of Nigerian languages, such as Yoruba…
▽ More
This paper describes foundational efforts with SautiDB-Naija, a novel corpus of non-native (L2) Nigerian English speech. We describe how the corpus was created and curated as well as preliminary experiments with accent classification and learning Nigerian accent embeddings. The initial version of the corpus includes over 900 recordings from L2 English speakers of Nigerian languages, such as Yoruba, Igbo, Edo, Efik-Ibibio, and Igala. We further demonstrate how fine-tuning on a pre-trained model like wav2vec can yield representations suitable for related speech tasks such as accent classification. SautiDB-Naija has been published to Zenodo for general use under a flexible Creative Commons License.
△ Less
Submitted 12 December, 2021;
originally announced December 2021.
-
AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence
Authors:
Yun-Ning Hung,
Karn N. Watcharasupat,
Chih-Wei Wu,
Iroro Orife,
Kelian Li,
Pavan Seshadri,
Junyoung Lee
Abstract:
We propose a dataset, AVASpeech-SMAD, to assist speech and music activity detection research. With frame-level music labels, the proposed dataset extends the existing AVASpeech dataset, which originally consists of 45 hours of audio and speech activity labels. To the best of our knowledge, the proposed AVASpeech-SMAD is the first open-source dataset that features strong polyphonic labels for both…
▽ More
We propose a dataset, AVASpeech-SMAD, to assist speech and music activity detection research. With frame-level music labels, the proposed dataset extends the existing AVASpeech dataset, which originally consists of 45 hours of audio and speech activity labels. To the best of our knowledge, the proposed AVASpeech-SMAD is the first open-source dataset that features strong polyphonic labels for both music and speech. The dataset was manually annotated and verified via an iterative cross-checking process. A simple automatic examination was also implemented to further improve the quality of the labels. Evaluation results from two state-of-the-art SMAD systems are also provided as a benchmark for future reference.
△ Less
Submitted 1 November, 2021;
originally announced November 2021.
-
Audio Spectrogram Factorization for Classification of Telephony Signals below the Auditory Threshold
Authors:
Iroro Orife,
Shane Walker,
Jason Flaks
Abstract:
Traffic Pum** attacks are a form of high-volume SPAM that target telephone networks, defraud customers and squander telephony resources. One type of call in these attacks is characterized by very low-amplitude signal levels, notably below the auditory threshold. We propose a technique to classify so-called "dead air" or "silent" SPAM calls based on features derived from factorizing the caller au…
▽ More
Traffic Pum** attacks are a form of high-volume SPAM that target telephone networks, defraud customers and squander telephony resources. One type of call in these attacks is characterized by very low-amplitude signal levels, notably below the auditory threshold. We propose a technique to classify so-called "dead air" or "silent" SPAM calls based on features derived from factorizing the caller audio spectrogram. We describe the algorithms for feature extraction and classification as well as our data collection methods and production performance on millions of calls per week.
△ Less
Submitted 9 November, 2018;
originally announced November 2018.