SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Authors:
Holy Lovenia,
Rahmad Mahendra,
Salsabil Maulana Akbar,
Lester James V. Miranda,
Jennifer Santoso,
Elyanah Aco,
Akhdan Fadhilah,
Jonibek Mansurov,
Joseph Marvin Imperial,
Onno P. Kampman,
Joel Ruben Antony Moniz,
Muhammad Ravi Shulthan Habibi,
Frederikus Hudi,
Railey Montalan,
Ryan Ignatius,
Joanito Agili Lopo,
William Nixon,
Börje F. Karlsson,
James Jaya,
Ryandito Diandaru,
Yuze Gao,
Patrick Amadeus,
Bin Wang,
Jan Christian Blaise Cruz,
Chenxi Whitehouse
, et al. (36 additional authors not shown)
Abstract:
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t…
▽ More
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages
Authors:
Joanito Agili Lopo,
Radius Tanone
Abstract:
In Indonesia, local languages play an integral role in the culture. However, the available Indonesian language resources still fall into the category of limited data in the Natural Language Processing (NLP) field. This is become problematic when build NLP model for these languages. To address this gap, we introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local lan…
▽ More
In Indonesia, local languages play an integral role in the culture. However, the available Indonesian language resources still fall into the category of limited data in the Natural Language Processing (NLP) field. This is become problematic when build NLP model for these languages. To address this gap, we introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages. Our goal is to enhance access and utilization of these resources, extending their reach within the country. We explained in a detail the dataset collection process and associated challenges. Additionally, we experimented with translation task using the IBM Model 1 due to data constraints. The result showed that the performance of each language already shows good indications for further development. Challenges such as lexical variation, smoothing effects, and cross-linguistic variability are discussed. We intend to evaluate the corpus using advanced NLP techniques for low-resource languages, paving the way for multilingual translation models.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.