Showing 1–1 of 1 results for author: Eusse, J

Search v0.5.6 released 2020-02-24

arXiv:2308.15710 [pdf, ps, other]

cs.AI cs.LG

Speech Wikimedia: A 77 Language Multilingual Speech Dataset

Authors: Rafael Mosquera Gómez, Julián Eusse, Juan Ciro, Daniel Galvez, Ryan Hileman, Kurt Bollacker, David Kanter

Abstract: The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recogni… ▽ More The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models. △ Less

Submitted 29 August, 2023; originally announced August 2023.

Comments: Data-Centric Machine Learning Workshop at the International Machine Learning Conference 2023 (ICML)

Search v0.5.6 released 2020-02-24