Skip to main content

Showing 1–1 of 1 results for author: Eusse, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2308.15710  [pdf, ps, other

    cs.AI cs.LG

    Speech Wikimedia: A 77 Language Multilingual Speech Dataset

    Authors: Rafael Mosquera Gómez, Julián Eusse, Juan Ciro, Daniel Galvez, Ryan Hileman, Kurt Bollacker, David Kanter

    Abstract: The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recogni… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: Data-Centric Machine Learning Workshop at the International Machine Learning Conference 2023 (ICML)