-
AI4D -- African Language Program
Authors:
Kathleen Siminyu,
Godson Kalipe,
Davor Orlic,
Jade Abbott,
Vukosi Marivate,
Sackey Freshia,
Prateek Sibal,
Bhanu Neupane,
David I. Adelani,
Amelia Taylor,
Jamiil Toure ALI,
Kevin Degila,
Momboladji Balogoun,
Thierno Ibrahima DIOP,
Davis David,
Chayma Fourati,
Hatem Haddad,
Malek Naski
Abstract:
Advances in speech and language technologies enable tools such as voice-search, text-to-speech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese. Without foundational digital resources for African languages, which are considered low-resource in the digital context, these advanced tools remain out of reach. This…
▽ More
Advances in speech and language technologies enable tools such as voice-search, text-to-speech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese. Without foundational digital resources for African languages, which are considered low-resource in the digital context, these advanced tools remain out of reach. This work details the AI4D - African Language Program, a 3-part project that 1) incentivised the crowd-sourcing, collection and curation of language datasets through an online quantitative and qualitative challenge, 2) supported research fellows for a period of 3-4 months to create datasets annotated for NLP tasks, and 3) hosted competitive Machine Learning challenges on the basis of these datasets. Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets through hosting of competitive ML challenges.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
Authors:
Wilhelmina Nekoto,
Vukosi Marivate,
Tshinondiwa Matsila,
Timi Fasubaa,
Tajudeen Kolawole,
Taiwo Fagbohungbe,
Solomon Oluwole Akinola,
Shamsuddeen Hassan Muhammad,
Salomon Kabongo,
Salomey Osei,
Sackey Freshia,
Rubungo Andre Niyongabo,
Ricky Macharm,
Perez Ogayo,
Orevaoghene Ahia,
Musie Meressa,
Mofe Adeyemi,
Masabata Mokgesi-Selinga,
Lawrence Okegbemi,
Laura Jane Martinus,
Kolawole Tajudeen,
Kevin Degila,
Kelechi Ogueji,
Kathleen Siminyu,
Julia Kreutzer
, et al. (23 additional authors not shown)
Abstract:
Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communicat…
▽ More
Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.
△ Less
Submitted 6 November, 2020; v1 submitted 5 October, 2020;
originally announced October 2020.
-
Masakhane -- Machine Translation For Africa
Authors:
Iroro Orife,
Julia Kreutzer,
Blessing Sibanda,
Daniel Whitenack,
Kathleen Siminyu,
Laura Martinus,
Jamiil Toure Ali,
Jade Abbott,
Vukosi Marivate,
Salomon Kabongo,
Musie Meressa,
Espoir Murhabazi,
Orevaoghene Ahia,
Elan van Biljon,
Arshath Ramkilowan,
Adewale Akinfaderin,
Alp Öktem,
Wole Akin,
Ghollah Kioko,
Kevin Degila,
Herman Kamper,
Bonaventure Dossou,
Chris Emezue,
Kelechi Ogueji,
Abdallah Bashir
Abstract:
Africa has over 2000 languages. Despite this, African languages account for a small portion of available resources and publications in Natural Language Processing (NLP). This is due to multiple factors, including: a lack of focus from government and funding, discoverability, a lack of community, sheer language complexity, difficulty in reproducing papers and no benchmarks to compare techniques. To…
▽ More
Africa has over 2000 languages. Despite this, African languages account for a small portion of available resources and publications in Natural Language Processing (NLP). This is due to multiple factors, including: a lack of focus from government and funding, discoverability, a lack of community, sheer language complexity, difficulty in reproducing papers and no benchmarks to compare techniques. To begin to address the identified problems, MASAKHANE, an open-source, continent-wide, distributed, online research effort for machine translation for African languages, was founded. In this paper, we discuss our methodology for building the community and spurring research from the African continent, as well as outline the success of the community in terms of addressing the identified problems affecting African NLP.
△ Less
Submitted 13 March, 2020;
originally announced March 2020.