Skip to main content

Showing 1–5 of 5 results for author: Diandaru, R

.
  1. arXiv:2406.10118  [pdf, other

    cs.CL

    SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

    Authors: Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse , et al. (36 additional authors not shown)

    Abstract: Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: https://github.com/SEACrowd

  2. arXiv:2406.05967  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

    Authors: David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song , et al. (50 additional authors not shown)

    Abstract: Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recen… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

  3. arXiv:2402.13917  [pdf, other

    cs.CL cs.AI

    Could We Have Had Better Multilingual LLMs If English Was Not the Central Language?

    Authors: Ryandito Diandaru, Lucky Susanto, Zilu Tang, Ayu Purwarianti, Derry Wijaya

    Abstract: Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on. However, the impact of factors beyond training data size on translation performance remains a topic of debate, especially concerning languages not directly encountered during training. Our study delves into Llama2's translation capabilities. By modeling a linear relationship between l… ▽ More

    Submitted 5 April, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

    Comments: TDLE 2024

  4. arXiv:2311.00998  [pdf, other

    cs.CL cs.AI

    Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia

    Authors: Lucky Susanto, Ryandito Diandaru, Adila Krisnadhi, Ayu Purwarianti, Derry Wijaya

    Abstract: Neural machine translation (NMT) for low-resource local languages in Indonesia faces significant challenges, including the need for a representative benchmark and limited data availability. This work addresses these challenges by comprehensively analyzing training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our study encompasses v… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Comments: Accepted on SEALP 2023, Workshop in IJCNLP-AACL 2023

  5. arXiv:2212.09648  [pdf, other

    cs.CL cs.AI

    NusaCrowd: Open Source Initiative for Indonesian NLP Resources

    Authors: Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Fajri Koto, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Ivan Halim Parmonangan, Ika Alfina, Muhammad Satrio Wicaksono, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Akbar Septiandri, James Jaya, Kaustubh D. Dhole, Arie Ardiyanti Suryani, Rifki Afina Putri , et al. (22 additional authors not shown)

    Abstract: We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple exp… ▽ More

    Submitted 21 July, 2023; v1 submitted 19 December, 2022; originally announced December 2022.