-
Text Categorization Can Enhance Domain-Agnostic Stopword Extraction
Authors:
Houcemeddine Turki,
Naome A. Etori,
Mohamed Ali Hadj Taieb,
Abdul-Hakeem Omotayo,
Chris Chinenye Emezue,
Mohamed Ben Aouicha,
Ayodele Awokoya,
Falalu Ibrahim Lawan,
Doreen Nixdorf
Abstract:
This paper investigates the role of text categorization in streamlining stopword extraction in natural language processing (NLP), specifically focusing on nine African languages alongside French. By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection…
▽ More
This paper investigates the role of text categorization in streamlining stopword extraction in natural language processing (NLP), specifically focusing on nine African languages alongside French. By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages. Nevertheless, linguistic variances result in lower detection rates for certain languages. Interestingly, we find that while over 40% of stopwords are common across news categories, less than 15% are unique to a single category. Uncommon stopwords add depth to text but their classification as stopwords depends on context. Therefore combining statistical and linguistic approaches creates comprehensive stopword lists, highlighting the value of our hybrid method. This research enhances NLP for African languages and underscores the importance of text categorization in stopword extraction.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
A Survey on African Computer Vision Datasets, Topics and Researchers
Authors:
Abdul-Hakeem Omotayo,
Ashery Mbilinyi,
Lukman Ismaila,
Houcemeddine Turki,
Mahmoud Abdien,
Karim Gamal,
Idriss Tondji,
Yvan Pimi,
Naome A. Etori,
Marwa M. Matar,
Clifford Broni-Bediako,
Abigail Oppong,
Mai Gamal,
Eman Ehab,
Gbetondji Dovonon,
Zainab Akinjobi,
Daniel Ajisafe,
Oluwabukola G. Adegboro,
Mennatullah Siam
Abstract:
Computer vision encompasses a range of tasks such as object detection, semantic segmentation, and 3D reconstruction. Despite its relevance to African communities, research in this field within Africa represents only 0.06% of top-tier publications over the past decade. This study undertakes a thorough analysis of 63,000 Scopus-indexed computer vision publications from Africa, spanning from 2012 to…
▽ More
Computer vision encompasses a range of tasks such as object detection, semantic segmentation, and 3D reconstruction. Despite its relevance to African communities, research in this field within Africa represents only 0.06% of top-tier publications over the past decade. This study undertakes a thorough analysis of 63,000 Scopus-indexed computer vision publications from Africa, spanning from 2012 to 2022. The aim is to provide a survey of African computer vision topics, datasets and researchers. A key aspect of our study is the identification and categorization of African Computer Vision datasets using large language models that automatically parse abstracts of these publications. We also provide a compilation of unofficial African Computer Vision datasets distributed through challenges or data hosting platforms, and provide a full taxonomy of dataset categories. Our survey also pinpoints computer vision topics trends specific to different African regions, indicating their unique focus areas. Additionally, we carried out an extensive survey to capture the views of African researchers on the current state of computer vision research in the continent and the structural barriers they believe need urgent attention. In conclusion, this study catalogs and categorizes Computer Vision datasets and topics contributed or initiated by African institutions and identifies barriers to publishing in top-tier Computer Vision venues. This survey underscores the importance of encouraging African researchers and institutions in advancing computer vision research in the continent. It also stresses on the need for research topics to be more aligned with the needs of African communities.
△ Less
Submitted 4 February, 2024; v1 submitted 21 January, 2024;
originally announced January 2024.
-
AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages
Authors:
Jiayi Wang,
David Ifeoluwa Adelani,
Sweta Agrawal,
Marek Masiak,
Ricardo Rei,
Eleftheria Briakou,
Marine Carpuat,
Xuanli He,
Sofia Bourhim,
Andiswa Bukula,
Muhidin Mohamed,
Temitayo Olatoye,
Tosin Adewumi,
Hamam Mokayed,
Christine Mwase,
Wangui Kimotho,
Foutse Yuehgoh,
Anuoluwapo Aremu,
Jessica Ojo,
Shamsuddeen Hassan Muhammad,
Salomey Osei,
Abdul-Hakeem Omotayo,
Chiamaka Chukwuneke,
Perez Ogayo,
Oumaima Hourrane
, et al. (33 additional authors not shown)
Abstract:
Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of eval…
▽ More
Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).
△ Less
Submitted 23 April, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Towards a Better Understanding of the Computer Vision Research Community in Africa
Authors:
Abdul-Hakeem Omotayo,
Mai Gamal,
Eman Ehab,
Gbetondji Dovonon,
Zainab Akinjobi,
Ismaila Lukman,
Houcemeddine Turki,
Mahmod Abdien,
Idriss Tondji,
Abigail Oppong,
Yvan Pimi,
Karim Gamal,
Ro'ya-CV4Africa,
Mennatullah Siam
Abstract:
Computer vision is a broad field of study that encompasses different tasks (e.g., object detection). Although computer vision is relevant to the African communities in various applications, yet computer vision research is under-explored in the continent and constructs only 0.06% of top-tier publications in the last ten years. In this paper, our goal is to have a better understanding of the compute…
▽ More
Computer vision is a broad field of study that encompasses different tasks (e.g., object detection). Although computer vision is relevant to the African communities in various applications, yet computer vision research is under-explored in the continent and constructs only 0.06% of top-tier publications in the last ten years. In this paper, our goal is to have a better understanding of the computer vision research conducted in Africa and provide pointers on whether there is equity in research or not. We do this through an empirical analysis of the African computer vision publications that are Scopus indexed, where we collect around 63,000 publications over the period 2012-2022. We first study the opportunities available for African institutions to publish in top-tier computer vision venues. We show that African publishing trends in top-tier venues over the years do not exhibit consistent growth, unlike other continents such as North America or Asia. Moreover, we study all computer vision publications beyond top-tier venues in different African regions to find that mainly Northern and Southern Africa are publishing in computer vision with 68.5% and 15.9% of publications, resp. Nonetheless, we highlight that both Eastern and Western Africa are exhibiting a promising increase with the last two years closing the gap with Southern Africa. Additionally, we study the collaboration patterns in these publications to find that most of these exhibit international collaborations rather than African ones. We also show that most of these publications include an African author that is a key contributor as the first or last author. Finally, we present the most recurring keywords in computer vision publications per African region.
△ Less
Submitted 4 February, 2024; v1 submitted 11 May, 2023;
originally announced May 2023.
-
MasakhaNEWS: News Topic Classification for African languages
Authors:
David Ifeoluwa Adelani,
Marek Masiak,
Israel Abebe Azime,
Jesujoba Alabi,
Atnafu Lambebo Tonja,
Christine Mwase,
Odunayo Ogundepo,
Bonaventure F. P. Dossou,
Akintunde Oladipo,
Doreen Nixdorf,
Chris Chinenye Emezue,
sana al-azzawi,
Blessing Sibanda,
Davis David,
Lolwethu Ndolela,
Jonathan Mukiibi,
Tunde Ajayi,
Tatiana Moteu,
Brian Odhiambo,
Abraham Owodunni,
Nnaemeka Obiefuna,
Muhidin Mohamed,
Shamsuddeen Hassan Muhammad,
Teshome Mulugeta Ababu,
Saheed Abdullahi Salahudeen
, et al. (40 additional authors not shown)
Abstract:
African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African…
▽ More
African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
△ Less
Submitted 20 September, 2023; v1 submitted 19 April, 2023;
originally announced April 2023.
-
Adapting to the Low-Resource Double-Bind: Investigating Low-Compute Methods on Low-Resource African Languages
Authors:
Colin Leong,
Herumb Shandilya,
Bonaventure F. P. Dossou,
Atnafu Lambebo Tonja,
Joel Mathew,
Abdul-Hakeem Omotayo,
Oreen Yousuf,
Zainab Akinjobi,
Chris Chinenye Emezue,
Shamsudeen Muhammad,
Steven Kolawole,
Younwoo Choi,
Tosin Adewumi
Abstract:
Many natural language processing (NLP) tasks make use of massively pre-trained language models, which are computationally expensive. However, access to high computational resources added to the issue of data scarcity of African languages constitutes a real barrier to research experiments on these languages. In this work, we explore the applicability of low-compute approaches such as language adapt…
▽ More
Many natural language processing (NLP) tasks make use of massively pre-trained language models, which are computationally expensive. However, access to high computational resources added to the issue of data scarcity of African languages constitutes a real barrier to research experiments on these languages. In this work, we explore the applicability of low-compute approaches such as language adapters in the context of this low-resource double-bind. We intend to answer the following question: do language adapters allow those who are doubly bound by data and compute to practically build useful models? Through fine-tuning experiments on African languages, we evaluate their effectiveness as cost-effective approaches to low-resource African NLP. Using solely free compute resources, our results show that language adapters achieve comparable performances to massive pre-trained language models which are heavy on computational resources. This opens the door to further experimentation and exploration on full-extent of language adapters capacities.
△ Less
Submitted 29 March, 2023;
originally announced March 2023.