-
arXiv:2308.15710 [pdf, ps, other]
Speech Wikimedia: A 77 Language Multilingual Speech Dataset
Abstract: The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recogni… ▽ More
Submitted 29 August, 2023; originally announced August 2023.
Comments: Data-Centric Machine Learning Workshop at the International Machine Learning Conference 2023 (ICML)