Skip to main content

Showing 1–2 of 2 results for author: Shadieq, N

.
  1. arXiv:2404.06138  [pdf, other

    cs.CL

    Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

    Authors: Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

    Abstract: Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Cendol models are released under Apache 2.0 license and will be made publicly available soon

  2. arXiv:2309.10661  [pdf, other

    cs.CL cs.AI

    NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

    Authors: Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Maulana Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, Hanung Wahyuning Linuwih, Bryan Wilie, Galih Pradipta Muridan, Genta Indra Winata, David Moeljadi, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

    Abstract: Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on develo** labeled and unlabeled corpora for these languages through online scra** and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resul… ▽ More

    Submitted 19 September, 2023; v1 submitted 19 September, 2023; originally announced September 2023.