Skip to main content

Showing 1–2 of 2 results for author: Lastrucci, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.09141  [pdf, ps, other

    cs.CL

    PuoBERTa: Training and evaluation of a curated language model for Setswana

    Authors: Vukosi Marivate, Moseli Mots'Oehli, Valencia Wagner, Richard Lastrucci, Isheanesu Dzingirai

    Abstract: Natural language processing (NLP) has made significant progress for well-resourced languages such as English but lagged behind for low-resource languages like Setswana. This paper addresses this gap by presenting PuoBERTa, a customised masked language model trained specifically for Setswana. We cover how we collected, curated, and prepared diverse monolingual texts to generate a high-quality corpu… ▽ More

    Submitted 24 October, 2023; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: Accepted for SACAIR 2023

  2. arXiv:2303.03750  [pdf, other

    cs.CL

    Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora

    Authors: Richard Lastrucci, Isheanesu Dzingirai, Jenalea Rajab, Andani Madodonga, Matimba Shingange, Daniel N**i, Vukosi Marivate

    Abstract: This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering the South African Government newspaper (Vuk'uzenzele), as well as South African government speeches (ZA-gov-multilingual), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corp… ▽ More

    Submitted 5 April, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

    Comments: Accepted and to appear at Fourth workshop on Resources for African Indigenous Languages (RAIL) at EACL 2023