Skip to main content

Showing 1–30 of 30 results for author: Dossou, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.15721  [pdf, other

    cs.CV cs.AI cs.HC cs.LG

    A Study of Acquisition Functions for Medical Imaging Deep Active Learning

    Authors: Bonaventure F. P. Dossou

    Abstract: The Deep Learning revolution has enabled groundbreaking achievements in recent years. From breast cancer detection to protein folding, deep learning algorithms have been at the core of very important advancements. However, these modern advancements are becoming more and more data-hungry, especially on labeled data whose availability is scarce: this is even more prevalent in the medical context. In… ▽ More

    Submitted 29 February, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

    Comments: Best Poster Award at Deep Learning Indaba 2023 Conference

  2. arXiv:2310.00274  [pdf, other

    cs.CL

    AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR

    Authors: Tobi Olatunji, Tejumade Afonja, Aditya Yadavalli, Chris Chinenye Emezue, Sahib Singh, Bonaventure F. P. Dossou, Joanne Osuchukwu, Salomey Osei, Atnafu Lambebo Tonja, Naome Etori, Clinton Mbataku

    Abstract: Africa has a very low doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day -- a heavy patient burden compared with developed countries -- but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of… ▽ More

    Submitted 30 September, 2023; originally announced October 2023.

    Comments: Accepted to TACL 2023. This is a pre-MIT Press publication version

  3. arXiv:2308.14280  [pdf, other

    cs.CL cs.AI

    FonMTL: Towards Multitask Learning for the Fon Language

    Authors: Bonaventure F. P. Dossou, Iffanice Houndayi, Pamely Zantou, Gilles Hacheme

    Abstract: The Fon language, spoken by an average 2 million of people, is a truly low-resourced African language, with a limited online presence, and existing datasets (just to name but a few). Multitask learning is a learning paradigm that aims to improve the generalization capacity of a model by sharing knowledge across different but related tasks: this could be prevalent in very data-scarce scenarios. In… ▽ More

    Submitted 11 September, 2023; v1 submitted 27 August, 2023; originally announced August 2023.

    Comments: Accepted at WiNLP workshop, co-located at EMNLP 2023

  4. arXiv:2306.02105  [pdf, other

    cs.CL cs.SD eess.AS

    Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models

    Authors: Bonaventure F. P. Dossou

    Abstract: Accents play a pivotal role in sha** human communication, enhancing our ability to convey and comprehend messages with clarity and cultural nuance. While there has been significant progress in Automatic Speech Recognition (ASR), African-accented English ASR has been understudied due to a lack of training datasets, which are often expensive to create and demand colossal human labor. Combining sev… ▽ More

    Submitted 4 June, 2024; v1 submitted 3 June, 2023; originally announced June 2023.

    Comments: Preprint Under review. Previously accepted at SIGUL-LREC 2024 Workshop

  5. arXiv:2306.00253  [pdf, other

    cs.CL cs.CY

    AfriNames: Most ASR models "butcher" African Names

    Authors: Tobi Olatunji, Tejumade Afonja, Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Chris Chinenye Emezue, Amina Mardiyyah Rufai, Sahib Singh

    Abstract: Useful conversational agents must accurately capture named entities to minimize error for downstream tasks, for example, asking a voice assistant to play a track from a certain artist, initiating navigation to a specific location, or documenting a laboratory result for a patient. However, where named entities such as ``Ukachukwu`` (Igbo), ``Lakicia`` (Swahili), or ``Ingabire`` (Rwandan) are spoken… ▽ More

    Submitted 2 June, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech 2023 (Main Conference)

  6. arXiv:2305.13989  [pdf, other

    cs.CL

    MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

    Authors: Cheikh M. Bamba Dione, David Adelani, Peter Nabende, Jesujoba Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jonathan Mukiibi, Blessing Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula, Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munkoh-Buabeng, victoire Memdjokam Koagne, Fatoumata Ouoba Kabore, Amelia Taylor, Godson Kalipe, Tebogo Macucwa, Vukosi Marivate , et al. (19 additional authors not shown)

    Abstract: In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-l… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023 (Main conference)

  7. arXiv:2305.06897  [pdf, other

    cs.CL cs.AI cs.IR

    AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

    Authors: Odunayo Ogundepo, Tajuddeen R. Gwadabe, Clara E. Rivera, Jonathan H. Clark, Sebastian Ruder, David Ifeoluwa Adelani, Bonaventure F. P. Dossou, Abdou Aziz DIOP, Claytone Sikasote, Gilles Hacheme, Happy Buzaaba, Ignatius Ezeani, Rooweither Mabuya, Salomey Osei, Chris Emezue, Albert Njoroge Kahira, Shamsuddeen H. Muhammad, Akintunde Oladipo, Abraham Toluwase Owodunni, Atnafu Lambebo Tonja, Iyanuoluwa Shode, Akari Asai, Tunde Oluwaseyi Ajayi, Clemencia Siro, Steven Arthur , et al. (27 additional authors not shown)

    Abstract: African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  8. arXiv:2304.12155  [pdf, ps, other

    cs.CL cs.LG

    The African Stopwords project: curating stopwords for African languages

    Authors: Chris Emezue, Hellina Nigatu, Cynthia Thinwa, Helper Zhou, Shamsuddeen Muhammad, Lerato Louis, Idris Abdulmumin, Samuel Oyerinde, Benjamin Ajibade, Olanrewaju Samuel, Oviawe Joshua, Emeka Onwuegbuzia, Handel Emezue, Ifeoluwatayo A. Ige, Atnafu Lambebo Tonja, Chiamaka Chukwuneke, Bonaventure F. P. Dossou, Naome A. Etori, Mbonu Chinedu Emmanuel, Oreen Yousuf, Kaosarat Aina, Davis David

    Abstract: Stopwords are fundamental in Natural Language Processing (NLP) techniques for information retrieval. One of the common tasks in preprocessing of text data is the removal of stopwords. Currently, while high-resource languages like English benefit from the availability of several stopwords, low-resource languages, such as those found in the African continent, have none that are standardized and avai… ▽ More

    Submitted 21 March, 2023; originally announced April 2023.

    Comments: Accepted at the AfricaNLP workshop at ICLR2022

  9. arXiv:2304.09972  [pdf, other

    cs.CL

    MasakhaNEWS: News Topic Classification for African languages

    Authors: David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, sana al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen , et al. (40 additional authors not shown)

    Abstract: African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African… ▽ More

    Submitted 20 September, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

    Comments: Accepted to IJCNLP-AACL 2023 (main conference)

  10. arXiv:2303.16985  [pdf, other

    cs.CL cs.AI

    Adapting to the Low-Resource Double-Bind: Investigating Low-Compute Methods on Low-Resource African Languages

    Authors: Colin Leong, Herumb Shandilya, Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Joel Mathew, Abdul-Hakeem Omotayo, Oreen Yousuf, Zainab Akinjobi, Chris Chinenye Emezue, Shamsudeen Muhammad, Steven Kolawole, Younwoo Choi, Tosin Adewumi

    Abstract: Many natural language processing (NLP) tasks make use of massively pre-trained language models, which are computationally expensive. However, access to high computational resources added to the issue of data scarcity of African languages constitutes a real barrier to research experiments on these languages. In this work, we explore the applicability of low-compute approaches such as language adapt… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: Accepted to AfricaNLP workshop at ICLR2023

  11. arXiv:2303.10730  [pdf, other

    eess.IV cs.CV cs.LG

    Pretrained Vision Models for Predicting High-Risk Breast Cancer Stage

    Authors: Bonaventure F. P. Dossou, Yenoukoume S. K. Gbenou, Miglanche Ghomsi Nono

    Abstract: Cancer is increasingly a global health issue. Seconding cardiovascular diseases, cancers are the second biggest cause of death in the world with millions of people succumbing to the disease every year. According to the World Health Organization (WHO) report, by the end of 2020, more than 7.8 million women have been diagnosed with breast cancer, making it the world's most prevalent cancer. In this… ▽ More

    Submitted 19 March, 2023; originally announced March 2023.

    Comments: Accepted at Machine Learning for Global Health Workshop, ICLR 2023

  12. arXiv:2211.03263  [pdf, other

    cs.CL cs.AI cs.LG

    AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

    Authors: Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey Osei, Abigail Oppong, Iyanuoluwa Shode, Oluwabusayo Olufunke Awoyomi, Chris Chinenye Emezue

    Abstract: In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model con… ▽ More

    Submitted 23 November, 2022; v1 submitted 6 November, 2022; originally announced November 2022.

    Comments: Third Workshop on Simple and Efficient Natural Language Processing, EMNLP 2022

  13. arXiv:2210.12928  [pdf, other

    cs.LG cs.AI

    GFlowOut: Dropout with Generative Flow Networks

    Authors: Dianbo Liu, Moksh Jain, Bonaventure Dossou, Qianli Shen, Salem Lahlou, Anirudh Goyal, Nikolay Malkin, Chris Emezue, Dinghuai Zhang, Nadhir Hassen, Xu Ji, Kenji Kawaguchi, Yoshua Bengio

    Abstract: Bayesian Inference offers principled tools to tackle many critical problems with modern neural networks such as poor calibration and generalization, and data inefficiency. However, scaling Bayesian inference to large architectures is challenging and requires restrictive approximations. Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference and to estimate unce… ▽ More

    Submitted 23 June, 2023; v1 submitted 23 October, 2022; originally announced October 2022.

  14. arXiv:2210.12391  [pdf, other

    cs.CL

    MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

    Authors: David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba O. Alabi, Shamsuddeen H. Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau , et al. (20 additional authors not shown)

    Abstract: African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity r… ▽ More

    Submitted 15 November, 2022; v1 submitted 22 October, 2022; originally announced October 2022.

    Comments: Accepted to EMNLP 2022 (updated Github link)

  15. arXiv:2209.13518  [pdf

    q-bio.BM cs.AI cs.LG

    Graph-Based Active Machine Learning Method for Diverse and Novel Antimicrobial Peptides Generation and Selection

    Authors: Bonaventure F. P. Dossou, Dianbo Liu, Xu Ji, Moksh Jain, Almer M. van der Sloot, Roger Palou, Michael Tyers, Yoshua Bengio

    Abstract: As antibiotic-resistant bacterial strains are rapidly spreading worldwide, infections caused by these strains are emerging as a global crisis causing the death of millions of people every year. Antimicrobial Peptides (AMPs) are one of the candidates to tackle this problem because of their potential diversity, and ability to favorably modulate the host immune response. However, large-scale screenin… ▽ More

    Submitted 18 September, 2022; originally announced September 2022.

    Comments: Under Review at Sciences Advances

  16. arXiv:2205.02022  [pdf, other

    cs.CL

    A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

    Authors: David Ifeoluwa Adelani, Jesujoba Oluwadara Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Chinenye Emezue, Colin Leong, Michael Beukman, Shamsuddeen Hassan Muhammad, Guyo Dub Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ayoade Ajibade, Tunde Oluwaseyi Ajayi , et al. (20 additional authors not shown)

    Abstract: Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models… ▽ More

    Submitted 22 August, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

    Comments: Accepted to NAACL 2022 (added evaluation data for amh, kin, nya, sna, xho)

  17. arXiv:2204.04306  [pdf, other

    cs.CL cs.AI cs.CY

    MMTAfrica: Multilingual Machine Translation for African Languages

    Authors: Chris C. Emezue, Bonaventure F. P. Dossou

    Abstract: In this paper, we focus on the task of multilingual machine translation for African languages and describe our contribution in the 2021 WMT Shared Task: Large-Scale Multilingual Machine Translation. We introduce MMTAfrica, the first many-to-many multilingual translation system for six African languages: Fon (fon), Igbo (ibo), Kinyarwanda (kin), Swahili/Kiswahili (swa), Xhosa (xho), and Yoruba (yor… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: WMT Shared Task, EMNLP 2021 (version 2)

    Journal ref: Proceedings of the Sixth Conference on Machine Translation (2021) 398-411, Association for Computational Linguistics

  18. arXiv:2203.04115  [pdf, other

    q-bio.BM cs.LG

    Biological Sequence Design with GFlowNets

    Authors: Moksh Jain, Emmanuel Bengio, Alex-Hernandez Garcia, Jarrid Rector-Brooks, Bonaventure F. P. Dossou, Chanakya Ekbote, Jie Fu, Tianyu Zhang, Micheal Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, Yoshua Bengio

    Abstract: Design of de novo biological sequences with desired properties, like protein and DNA sequences, often involves an active loop with several rounds of molecule ideation and expensive wet-lab evaluations. These experiments can consist of multiple stages, with increasing levels of precision and cost of evaluation, where candidates are filtered. This makes the diversity of proposed candidates a key con… ▽ More

    Submitted 24 May, 2023; v1 submitted 2 March, 2022; originally announced March 2022.

    Comments: ICML 2022. 15 pages, 3 figures. Code available at: https://github.com/MJ10/BioSeq-GFN-AL. Updated GFP results

  19. arXiv:2109.07916  [pdf, other

    eess.AS cs.CV cs.HC

    FSER: Deep Convolutional Neural Networks for Speech Emotion Recognition

    Authors: Bonaventure F. P. Dossou, Yeno K. S. Gbenou

    Abstract: Using mel-spectrograms over conventional MFCCs features, we assess the abilities of convolutional neural networks to accurately recognize and classify emotions from speech data. We introduce FSER, a speech emotion recognition model trained on four valid speech databases, achieving a high-classification accuracy of 95,05\%, over 8 different emotion classes: anger, anxiety, calm, disgust, happiness,… ▽ More

    Submitted 15 September, 2021; originally announced September 2021.

    Comments: ABAW Workshop, ICCV 2021

  20. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

    Authors: Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller , et al. (27 additional authors not shown)

    Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system… ▽ More

    Submitted 21 February, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted at TACL; pre-MIT Press publication version

    Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 50-72

  21. arXiv:2103.11811  [pdf

    cs.CL cs.AI

    MasakhaNER: Named Entity Recognition for African Languages

    Authors: David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D'souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi , et al. (36 additional authors not shown)

    Abstract: We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We… ▽ More

    Submitted 5 July, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted to TACL 2021, pre-MIT Press publication version

  22. arXiv:2103.08052  [pdf, other

    cs.CL cs.AI

    Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language

    Authors: Bonaventure F. P. Dossou, Chris C. Emezue

    Abstract: Building effective neural machine translation (NMT) models for very low-resourced and morphologically rich African indigenous languages is an open challenge. Besides the issue of finding available resources for them, a lot of work is put into preprocessing and tokenization. Recent studies have shown that standard tokenization methods do not always adequately deal with the grammatical, diacritical,… ▽ More

    Submitted 17 March, 2021; v1 submitted 14 March, 2021; originally announced March 2021.

    Journal ref: African NLP, EACL 2021

  23. arXiv:2103.07762  [pdf, ps, other

    cs.CL cs.AI cs.CY

    OkwuGbé: End-to-End Speech Recognition for Fon and Igbo

    Authors: Bonaventure F. P. Dossou, Chris C. Emezue

    Abstract: Language is inherent and compulsory for human communication. Whether expressed in a written or spoken way, it ensures understanding between people of the same and different regions. With the growing awareness and effort to include more low-resourced languages in NLP research, African languages have recently been a major subject of research in machine translation, and other text-based areas of NLP.… ▽ More

    Submitted 16 March, 2021; v1 submitted 13 March, 2021; originally announced March 2021.

    Journal ref: African NLP, EACL 2021

  24. arXiv:2103.05132  [pdf, other

    cs.CL cs.AI

    AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin

    Authors: Bonaventure F. P. Dossou, Mohammed Sabry

    Abstract: From Word2Vec to GloVe, word embedding models have played key roles in the current state-of-the-art results achieved in Natural Language Processing. Designed to give significant and unique vectorized representations of words and entities, those models have proven to efficiently extract similarities and establish relationships reflecting semantic and contextual meaning among words and entities. Afr… ▽ More

    Submitted 18 March, 2021; v1 submitted 8 March, 2021; originally announced March 2021.

    Journal ref: Africa NLP, EACL 2021

  25. arXiv:2012.03487  [pdf, other

    cs.AI cs.CV cs.CY

    An Approach to Intelligent Pneumonia Detection and Integration

    Authors: Bonaventure F. P. Dossou, Alena Iureva, Sayali R. Rajhans, Vamsi S. Pidikiti

    Abstract: Each year, over 2.5 million people, most of them in developed countries, die from pneumonia [1]. Since many studies have proved pneumonia is successfully treatable when timely and correctly diagnosed, many of diagnosis aids have been developed, with AI-based methods achieving high accuracies [2]. However, currently, the usage of AI in pneumonia detection is limited, in particular, due to challenge… ▽ More

    Submitted 7 December, 2020; originally announced December 2020.

  26. arXiv:2010.02353  [pdf, other

    cs.CL cs.AI cs.LG

    Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

    Authors: Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad, Salomon Kabongo, Salomey Osei, Sackey Freshia, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa, Mofe Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Jane Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer , et al. (23 additional authors not shown)

    Abstract: Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communicat… ▽ More

    Submitted 6 November, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: Findings of EMNLP 2020; updated benchmarks

  27. arXiv:2008.07302  [pdf, other

    cs.CY cs.CL

    Lanfrica: A Participatory Approach to Documenting Machine Translation Research on African Languages

    Authors: Chris C. Emezue, Bonaventure F. P. Dossou

    Abstract: Over the years, there have been campaigns to include the African languages in the growing research on machine translation (MT) in particular, and natural language processing (NLP) in general. Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages(Lewis, 2009; Bendor-Samuel, 2017). This makes it hard to keep track of the MT res… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

  28. arXiv:2006.09217  [pdf, ps, other

    cs.CL cs.LG

    FFR v1.1: Fon-French Neural Machine Translation

    Authors: Bonaventure F. P. Dossou, Chris C. Emezue

    Abstract: All over the world and especially in Africa, researchers are putting efforts into building Neural Machine Translation (NMT) systems to help tackle the language barriers in Africa, a continent of over 2000 different languages. However, the low-resourceness, diacritical, and tonal complexities of African languages are major issues being faced. The FFR project is a major step towards creating a robus… ▽ More

    Submitted 14 June, 2020; originally announced June 2020.

    Comments: Accepted for publication at the Widening Natural Language Processing (WiNLP) Workshop, The 58th Annual Meeting of the Association for Computational Linguistics, 2020

  29. arXiv:2003.12111  [pdf, ps, other

    cs.CL

    FFR V1.0: Fon-French Neural Machine Translation

    Authors: Bonaventure F. P. Dossou, Chris C. Emezue

    Abstract: Africa has the highest linguistic diversity in the world. On account of the importance of language to communication, and the importance of reliable, powerful and accurate machine translation models in modern inter-cultural communication, there have been (and still are) efforts to create state-of-the-art translation models for the many African languages. However, the low-resources, diacritical and… ▽ More

    Submitted 26 March, 2020; originally announced March 2020.

    Comments: Accepted for the AfricaNLP Workshop, ICLR 2020

  30. arXiv:2003.11529  [pdf, other

    cs.CL

    Masakhane -- Machine Translation For Africa

    Authors: Iroro Orife, Julia Kreutzer, Blessing Sibanda, Daniel Whitenack, Kathleen Siminyu, Laura Martinus, Jamiil Toure Ali, Jade Abbott, Vukosi Marivate, Salomon Kabongo, Musie Meressa, Espoir Murhabazi, Orevaoghene Ahia, Elan van Biljon, Arshath Ramkilowan, Adewale Akinfaderin, Alp Öktem, Wole Akin, Ghollah Kioko, Kevin Degila, Herman Kamper, Bonaventure Dossou, Chris Emezue, Kelechi Ogueji, Abdallah Bashir

    Abstract: Africa has over 2000 languages. Despite this, African languages account for a small portion of available resources and publications in Natural Language Processing (NLP). This is due to multiple factors, including: a lack of focus from government and funding, discoverability, a lack of community, sheer language complexity, difficulty in reproducing papers and no benchmarks to compare techniques. To… ▽ More

    Submitted 13 March, 2020; originally announced March 2020.

    Comments: Accepted for the AfricaNLP Workshop, ICLR 2020