-
Performant ASR Models for Medical Entities in Accented Speech
Authors:
Tejumade Afonja,
Tobi Olatunji,
Sewade Ogun,
Naome A. Etori,
Abraham Owodunni,
Moshood Yekini
Abstract:
Recent strides in automatic speech recognition (ASR) have accelerated their application in the medical domain where their performance on accented medical named entities (NE) such as drug names, diagnoses, and lab results, is largely unknown. We rigorously evaluate multiple ASR models on a clinical English dataset of 93 African accents. Our analysis reveals that despite some models achieving low ov…
▽ More
Recent strides in automatic speech recognition (ASR) have accelerated their application in the medical domain where their performance on accented medical named entities (NE) such as drug names, diagnoses, and lab results, is largely unknown. We rigorously evaluate multiple ASR models on a clinical English dataset of 93 African accents. Our analysis reveals that despite some models achieving low overall word error rates (WER), errors in clinical entities are higher, potentially posing substantial risks to patient safety. To empirically demonstrate this, we extract clinical entities from transcripts, develop a novel algorithm to align ASR predictions with these entities, and compute medical NE Recall, medical WER, and character error rate. Our results show that fine-tuning on accented clinical speech improves medical WER by a wide margin (25-34 % relative), improving their practical applicability in healthcare environments.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis
Authors:
Sewade Ogun,
Abraham T. Owodunni,
Tobi Olatunji,
Eniola Alese,
Babatunde Oladimeji,
Tejumade Afonja,
Kayode Olaleye,
Naome A. Etori,
Tosin Adewumi
Abstract:
Recent advances in speech synthesis have enabled many useful applications like audio directions in Google Maps, screen readers, and automated content generation on platforms like TikTok. However, these systems are mostly dominated by voices sourced from data-rich geographies with personas representative of their source data. Although 3000 of the world's languages are domiciled in Africa, African v…
▽ More
Recent advances in speech synthesis have enabled many useful applications like audio directions in Google Maps, screen readers, and automated content generation on platforms like TikTok. However, these systems are mostly dominated by voices sourced from data-rich geographies with personas representative of their source data. Although 3000 of the world's languages are domiciled in Africa, African voices and personas are under-represented in these systems. As speech synthesis becomes increasingly democratized, it is desirable to increase the representation of African English accents. We present Afro-TTS, the first pan-African accented English speech synthesis system able to generate speech in 86 African accents, with 1000 personas representing the rich phonological diversity across the continent for downstream application in Education, Public Health, and Automated Content Creation. Speaker interpolation retains naturalness and accentedness, enabling the creation of new voices.
△ Less
Submitted 27 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
You are what you eat? Feeding foundation models a regionally diverse food dataset of World Wide Dishes
Authors:
Jabez Magomere,
Shu Ishida,
Tejumade Afonja,
Aya Salama,
Daniel Kochin,
Foutse Yuehgoh,
Imane Hamzaoui,
Raesetje Sefala,
Aisha Alaagib,
Elizaveta Semenova,
Lauren Crais,
Siobhan Mackenzie Hall
Abstract:
Foundation models are increasingly ubiquitous in our daily lives, used in everyday tasks such as text-image searches, interactions with chatbots, and content generation. As use increases, so does concern over the disparities in performance and fairness of these models for different people in different parts of the world. To assess these growing regional disparities, we present World Wide Dishes, a…
▽ More
Foundation models are increasingly ubiquitous in our daily lives, used in everyday tasks such as text-image searches, interactions with chatbots, and content generation. As use increases, so does concern over the disparities in performance and fairness of these models for different people in different parts of the world. To assess these growing regional disparities, we present World Wide Dishes, a mixed text and image dataset consisting of 765 dishes, with dish names collected in 131 local languages. World Wide Dishes has been collected purely through human contribution and decentralised means, by creating a website widely distributed through social networks. Using the dataset, we demonstrate a novel means of operationalising capability and representational biases in foundation models such as language models and text-to-image generative models. We enrich these studies with a pilot community review to understand, from a first-person perspective, how these models generate images for people in five African countries and the United States.
We find that these models generally do not produce quality text and image outputs of dishes specific to different regions. This is true even for the US, which is typically considered to be more well-resourced in training data - though the generation of US dishes does outperform that of the investigated African countries. The models demonstrate a propensity to produce outputs that are inaccurate as well as culturally misrepresentative, flattening, and insensitive. These failures in capability and representational bias have the potential to further reinforce stereotypes and disproportionately contribute to erasure based on region. The dataset and code are available at https://github.com/oxai/world-wide-dishes/.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Towards Biologically Plausible and Private Gene Expression Data Generation
Authors:
Dingfan Chen,
Marie Oestreich,
Tejumade Afonja,
Raouf Kerkouche,
Matthias Becker,
Mario Fritz
Abstract:
Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications. Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions. In this paper, we initiate a systematic analysis of how D…
▽ More
Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications. Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions. In this paper, we initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data. We conduct a comprehensive analysis of five representative DP generation methods, examining them from various angles, such as downstream utility, statistical properties, and biological plausibility. Our extensive evaluation illuminates the unique characteristics of each DP generation method, offering critical insights into the strengths and weaknesses of each approach, and uncovering intriguing possibilities for future developments. Perhaps surprisingly, our analysis reveals that most methods are capable of achieving seemingly reasonable downstream utility, according to the standard evaluation metrics considered in existing literature. Nevertheless, we find that none of the DP methods are able to accurately capture the biological characteristics of the real dataset. This observation suggests a potential over-optimistic assessment of current methodologies in this field and underscores a pressing need for future enhancements in model design.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR
Authors:
Tobi Olatunji,
Tejumade Afonja,
Aditya Yadavalli,
Chris Chinenye Emezue,
Sahib Singh,
Bonaventure F. P. Dossou,
Joanne Osuchukwu,
Salomey Osei,
Atnafu Lambebo Tonja,
Naome Etori,
Clinton Mbataku
Abstract:
Africa has a very low doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day -- a heavy patient burden compared with developed countries -- but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of…
▽ More
Africa has a very low doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day -- a heavy patient burden compared with developed countries -- but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of commercial clinical ASR systems is generally satisfactory. Furthermore, the recent performance of general domain ASR is approaching human accuracy. However, several gaps exist. Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. To our knowledge, there is no publicly available research or benchmark on accented African clinical ASR, and speech data is non-existent for the majority of African accents. We release AfriSpeech, 200hrs of Pan-African English speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.
△ Less
Submitted 30 September, 2023;
originally announced October 2023.
-
MargCTGAN: A "Marginally'' Better CTGAN for the Low Sample Regime
Authors:
Tejumade Afonja,
Dingfan Chen,
Mario Fritz
Abstract:
The potential of realistic and useful synthetic data is significant. However, current evaluation methods for synthetic tabular data generation predominantly focus on downstream task usefulness, often neglecting the importance of statistical properties. This oversight becomes particularly prominent in low sample scenarios, accompanied by a swift deterioration of these statistical measures. In this…
▽ More
The potential of realistic and useful synthetic data is significant. However, current evaluation methods for synthetic tabular data generation predominantly focus on downstream task usefulness, often neglecting the importance of statistical properties. This oversight becomes particularly prominent in low sample scenarios, accompanied by a swift deterioration of these statistical measures. In this paper, we address this issue by conducting an evaluation of three state-of-the-art synthetic tabular data generators based on their marginal distribution, column-pair correlation, joint distribution and downstream task utility performance across high to low sample regimes. The popular CTGAN model shows strong utility, but underperforms in low sample settings in terms of utility. To overcome this limitation, we propose MargCTGAN that adds feature matching of de-correlated marginals, which results in a consistent improvement in downstream utility as well as statistical properties of the synthetic data.
△ Less
Submitted 16 July, 2023;
originally announced July 2023.
-
AfriNames: Most ASR models "butcher" African Names
Authors:
Tobi Olatunji,
Tejumade Afonja,
Bonaventure F. P. Dossou,
Atnafu Lambebo Tonja,
Chris Chinenye Emezue,
Amina Mardiyyah Rufai,
Sahib Singh
Abstract:
Useful conversational agents must accurately capture named entities to minimize error for downstream tasks, for example, asking a voice assistant to play a track from a certain artist, initiating navigation to a specific location, or documenting a laboratory result for a patient. However, where named entities such as ``Ukachukwu`` (Igbo), ``Lakicia`` (Swahili), or ``Ingabire`` (Rwandan) are spoken…
▽ More
Useful conversational agents must accurately capture named entities to minimize error for downstream tasks, for example, asking a voice assistant to play a track from a certain artist, initiating navigation to a specific location, or documenting a laboratory result for a patient. However, where named entities such as ``Ukachukwu`` (Igbo), ``Lakicia`` (Swahili), or ``Ingabire`` (Rwandan) are spoken, automatic speech recognition (ASR) models' performance degrades significantly, propagating errors to downstream systems. We model this problem as a distribution shift and demonstrate that such model bias can be mitigated through multilingual pre-training, intelligent data augmentation strategies to increase the representation of African-named entities, and fine-tuning multilingual ASR models on multiple African accents. The resulting fine-tuned models show an 81.5\% relative WER improvement compared with the baseline on samples with African-named entities.
△ Less
Submitted 2 June, 2023; v1 submitted 31 May, 2023;
originally announced June 2023.
-
Proceedings of the NeurIPS 2021 Workshop on Machine Learning for the Develo** World: Global Challenges
Authors:
Paula Rodriguez Diaz,
Tejumade Afonja,
Konstantin Klemmer,
Aya Salama,
Niveditha Kalavakonda,
Oluwafemi Azeez,
Simone Fobi
Abstract:
These are the proceedings of the 5th workshop on Machine Learning for the Develo** World (ML4D), held as part of the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) on December 14th, 2021.
These are the proceedings of the 5th workshop on Machine Learning for the Develo** World (ML4D), held as part of the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) on December 14th, 2021.
△ Less
Submitted 10 January, 2023;
originally announced January 2023.
-
Generative Extraction of Audio Classifiers for Speaker Identification
Authors:
Tejumade Afonja,
Lucas Bourtoule,
Varun Chandrasekaran,
Sageev Oore,
Nicolas Papernot
Abstract:
It is perhaps no longer surprising that machine learning models, especially deep neural networks, are particularly vulnerable to attacks. One such vulnerability that has been well studied is model extraction: a phenomenon in which the attacker attempts to steal a victim's model by training a surrogate model to mimic the decision boundaries of the victim model. Previous works have demonstrated the…
▽ More
It is perhaps no longer surprising that machine learning models, especially deep neural networks, are particularly vulnerable to attacks. One such vulnerability that has been well studied is model extraction: a phenomenon in which the attacker attempts to steal a victim's model by training a surrogate model to mimic the decision boundaries of the victim model. Previous works have demonstrated the effectiveness of such an attack and its devastating consequences, but much of this work has been done primarily for image and text processing tasks. Our work is the first attempt to perform model extraction on {\em audio classification models}. We are motivated by an attacker whose goal is to mimic the behavior of the victim's model trained to identify a speaker. This is particularly problematic in security-sensitive domains such as biometric authentication. We find that prior model extraction techniques, where the attacker \textit{naively} uses a proxy dataset to attack a potential victim's model, fail. We therefore propose the use of a generative model to create a sufficiently large and diverse pool of synthetic attack queries. We find that our approach is able to extract a victim's model trained on \texttt{LibriSpeech} using queries synthesized with a proxy dataset based off of \texttt{VoxCeleb}; we achieve a test accuracy of 84.41\% with a budget of 3 million queries.
△ Less
Submitted 26 July, 2022;
originally announced July 2022.
-
Learning Nigerian accent embeddings from speech: preliminary results based on SautiDB-Naija corpus
Authors:
Tejumade Afonja,
Oladimeji Mudele,
Iroro Orife,
Kenechi Dukor,
Lawrence Francis,
Duru Goodness,
Oluwafemi Azeez,
Ademola Malomo,
Clinton Mbataku
Abstract:
This paper describes foundational efforts with SautiDB-Naija, a novel corpus of non-native (L2) Nigerian English speech. We describe how the corpus was created and curated as well as preliminary experiments with accent classification and learning Nigerian accent embeddings. The initial version of the corpus includes over 900 recordings from L2 English speakers of Nigerian languages, such as Yoruba…
▽ More
This paper describes foundational efforts with SautiDB-Naija, a novel corpus of non-native (L2) Nigerian English speech. We describe how the corpus was created and curated as well as preliminary experiments with accent classification and learning Nigerian accent embeddings. The initial version of the corpus includes over 900 recordings from L2 English speakers of Nigerian languages, such as Yoruba, Igbo, Edo, Efik-Ibibio, and Igala. We further demonstrate how fine-tuning on a pre-trained model like wav2vec can yield representations suitable for related speech tasks such as accent classification. SautiDB-Naija has been published to Zenodo for general use under a flexible Creative Commons License.
△ Less
Submitted 12 December, 2021;
originally announced December 2021.
-
Proceedings of the NeurIPS 2020 Workshop on Machine Learning for the Develo** World: Improving Resilience
Authors:
Tejumade Afonja,
Konstantin Klemmer,
Aya Salama,
Paula Rodriguez Diaz,
Niveditha Kalavakonda,
Oluwafemi Azeez
Abstract:
These are the proceedings of the 4th workshop on Machine Learning for the Develo** World (ML4D), held as part of the Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS) on Saturday, December 12th 2020.
These are the proceedings of the 4th workshop on Machine Learning for the Develo** World (ML4D), held as part of the Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS) on Saturday, December 12th 2020.
△ Less
Submitted 12 January, 2021;
originally announced January 2021.
-
Proceedings of NeurIPS 2019 Workshop on Machine Learning for the Develo** World: Challenges and Risks of ML4D
Authors:
Maria De-Arteaga,
Tejumade Afonja,
Amanda Coston
Abstract:
This is the proceedings of the 3rd ML4D workshop which was help in Vancouver, Canada on December 13, 2019 as part of the Neural Information Processing Systems conference.
This is the proceedings of the 3rd ML4D workshop which was help in Vancouver, Canada on December 13, 2019 as part of the Neural Information Processing Systems conference.
△ Less
Submitted 10 April, 2020; v1 submitted 1 January, 2020;
originally announced January 2020.