NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Authors:
Samuel Cahyawijaya,
Holy Lovenia,
Alham Fikri Aji,
Genta Indra Winata,
Bryan Wilie,
Rahmad Mahendra,
Christian Wibisono,
Ade Romadhony,
Karissa Vincentio,
Fajri Koto,
Jennifer Santoso,
David Moeljadi,
Cahya Wirawan,
Frederikus Hudi,
Ivan Halim Parmonangan,
Ika Alfina,
Muhammad Satrio Wicaksono,
Ilham Firdausi Putra,
Samsul Rahmadani,
Yulianti Oenang,
Ali Akbar Septiandri,
James Jaya,
Kaustubh D. Dhole,
Arie Ardiyanti Suryani,
Rifki Afina Putri
, et al. (22 additional authors not shown)
Abstract:
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple exp…
▽ More
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
△ Less
Submitted 21 July, 2023; v1 submitted 19 December, 2022;
originally announced December 2022.
ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair
Authors:
Alham Fikri Aji,
Tirana Noor Fatyanosa,
Radityo Eko Prasojo,
Philip Arthur,
Suci Fitriany,
Salma Qonitah,
Nadhifa Zulfa,
Tomi Santoso,
Mahendra Data
Abstract:
We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples…
▽ More
We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples using beam search and choose the most lexically diverse pair according to their sentence BLEU. We compare our generated corpus with the \texttt{ParaBank2}. According to our evaluation, our synthetic paraphrase pairs are semantically similar and lexically diverse.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.