Skip to main content

Showing 1–50 of 82 results for author: Abdul-Mageed, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01257  [pdf, other

    cs.CL cs.SD eess.AS

    uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation via Large-Scale Pseudo Labelling

    Authors: Abdul Waheed, Karima Kadaoui, Muhammad Abdul-Mageed

    Abstract: Recent work on distilling Whisper's knowledge into small models using pseudo-labels shows promising performance while reducing the size by up to 50\%. This results in small, efficient, and dedicated models. However, a critical step of distillation from pseudo-labels involves filtering high-quality predictions and using only those during training. This step requires ground truth to compare and filt… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Work in progress

  2. arXiv:2406.16751  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Zero-Shot Text-To-Speech for Arabic Dialects

    Authors: Khai Duy Doan, Abdul Waheed, Muhammad Abdul-Mageed

    Abstract: Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the i… ▽ More

    Submitted 25 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

  3. arXiv:2406.09933  [pdf, other

    cs.SD cs.AI cs.HC cs.LG

    What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

    Authors: Adham Ibrahim, Shady Shehata, A**kya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed

    Abstract: Speech emotion recognition (SER) is essential for enhancing human-computer interaction in speech-based applications. Despite improvements in specific emotional datasets, there is still a research gap in SER's capability to generalize across real-world situations. In this paper, we investigate approaches to generalize the SER system across different emotion datasets. In particular, we incorporate 1… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: ACCEPTED AT INTERSPEECH 2024, GREECE

  4. arXiv:2406.04512  [pdf, other

    cs.CL cs.SD eess.AS

    To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation

    Authors: Abdul Waheed, Karima Kadaoui, Muhammad Abdul-Mageed

    Abstract: Arabic is known to present unique challenges for Automatic Speech Recognition (ASR). On one hand, its rich linguistic diversity and wide range of dialects complicate the development of robust, inclusive models. On the other, current multilingual ASR models are compute-intensive and lack proper comprehensive evaluations. In light of these challenges, we distill knowledge from large teacher models i… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted at ACL'24 main

  5. arXiv:2405.11441  [pdf, other

    cs.IR cs.CL

    EmbSum: Leveraging the Summarization Capabilities of Large Language Models for Content-Based Recommendations

    Authors: Chiyu Zhang, Yifei Sun, Minghao Wu, Jun Chen, Jie Lei, Muhammad Abdul-Mageed, Rong **, Angli Liu, Ji Zhu, Sem Park, Ning Yao, Bo Long

    Abstract: Content-based recommendation systems play a crucial role in delivering personalized content to users in the digital world. In this work, we introduce EmbSum, a novel framework that enables offline pre-computations of users and candidate items while capturing the interactions within the user engagement history. By utilizing the pretrained encoder-decoder model and poly-attention layers, EmbSum deri… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

    Comments: Under review

  6. arXiv:2404.05943  [pdf, other

    cs.CL cs.AI

    Interplay of Machine Translation, Diacritics, and Diacritization

    Authors: Wei-Rui Chen, Ife Adebara, Muhammad Abdul-Mageed

    Abstract: We investigate two research questions: (1) how do machine translation (MT) and diacritization influence the performance of each other in a multi-task learning setting (2) the effect of kee** (vs. removing) diacritics on MT performance. We examine these two questions in both high-resource (HR) and low-resource (LR) settings across 55 different languages (36 African languages and 19 European langu… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted to NAACL 2024 Main Conference

  7. arXiv:2403.01106  [pdf, other

    cs.CL cs.AI

    Distilling Text Style Transfer With Self-Explanation From LLMs

    Authors: Chiyu Zhang, Honglong Cai, Yuezhang, Li, Yuexin Wu, Le Hou, Muhammad Abdul-Mageed

    Abstract: Text Style Transfer (TST) seeks to alter the style of text while retaining its core content. Given the constraints of limited parallel datasets for TST, we propose CoTeX, a framework that leverages large language models (LLMs) alongside chain-of-thought (CoT) prompting to facilitate TST. CoTeX distills the complex rewriting and reasoning capabilities of LLMs into more streamlined models capable of… ▽ More

    Submitted 4 May, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

    Comments: Accepted by NAACL Student Research Workshop 2024

  8. arXiv:2403.01031  [pdf, other

    cs.CL cs.AI

    Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks

    Authors: Fakhraddin Alwajih, El Moatez Billah Nagoudi, Gagan Bhatia, Abdelrahman Mohamed, Muhammad Abdul-Mageed

    Abstract: Multimodal large language models (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension. However, due to a lack of high-quality multimodal resources in languages other than English, success of MLLMs remains relatively limited to English-based settings. This poses significant challenges in develo** comparable models for other languages, inc… ▽ More

    Submitted 24 May, 2024; v1 submitted 1 March, 2024; originally announced March 2024.

  9. arXiv:2402.15951  [pdf, other

    cs.LG cs.CL cs.CY

    GreenLLaMA: A Framework for Detoxification with Explanations

    Authors: Md Tawkat Islam Khondaker, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan

    Abstract: Prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. Notably, prior works restrict the task of develo** detoxification models to only a seen subset of platforms, leaving the question of how the models would perform on unseen platforms unexplored. Additionally, these works do not address non-detoxifiability,… ▽ More

    Submitted 24 February, 2024; originally announced February 2024.

    Comments: 24 pages

  10. arXiv:2402.10986  [pdf, other

    cs.CL cs.AI

    FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models

    Authors: Gagan Bhatia, El Moatez Billah Nagoudi, Hasan Cavusoglu, Muhammad Abdul-Mageed

    Abstract: We introduce FinTral, a suite of state-of-the-art multimodal large language models (LLMs) built upon the Mistral-7b model and tailored for financial analysis. FinTral integrates textual, numerical, tabular, and image data. We enhance FinTral with domain-specific pretraining, instruction fine-tuning, and RLAIF training by exploiting a large collection of textual and visual datasets we curate for th… ▽ More

    Submitted 14 June, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

  11. arXiv:2402.10555  [pdf, other

    cs.IR cs.CL

    SPAR: Personalized Content-Based Recommendation via Long Engagement Attention

    Authors: Chiyu Zhang, Yifei Sun, Jun Chen, Jie Lei, Muhammad Abdul-Mageed, Sinong Wang, Rong **, Sem Park, Ning Yao, Bo Long

    Abstract: Leveraging users' long engagement histories is essential for personalized content recommendations. The success of pretrained language models (PLMs) in NLP has led to their use in encoding user histories and candidate items, framing content recommendations as textual semantic matching tasks. However, existing works still struggle with processing very long user historical text and insufficient user-… ▽ More

    Submitted 21 May, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: Under review

  12. arXiv:2401.01053  [pdf, other

    cs.CL

    Cheetah: Natural Language Generation for 517 African Languages

    Authors: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: Low-resource African languages pose unique challenges for natural language processing (NLP) tasks, including natural language generation (NLG). In this paper, we develop Cheetah, a massively multilingual NLG language model for African languages. Cheetah supports 517 African languages and language varieties, allowing us to address the scarcity of NLG resources and provide a solution to foster lingu… ▽ More

    Submitted 10 January, 2024; v1 submitted 2 January, 2024; originally announced January 2024.

  13. arXiv:2312.08400  [pdf, other

    cs.CL cs.AI

    Beyond English: Evaluating LLMs for Arabic Grammatical Error Correction

    Authors: Sang Yun Kwon, Gagan Bhatia, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

    Abstract: Large language models (LLMs) finetuned to follow human instruction have recently exhibited significant capabilities in various English NLP tasks. However, their performance in grammatical error correction (GEC), especially on languages other than English, remains significantly unexplored. In this work, we evaluate the abilities of instruction finetuned LLMs in Arabic GEC, a complex task due to Ara… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: arXiv admin note: text overlap with arXiv:2308.04492

  14. arXiv:2312.01536  [pdf, other

    cs.CV

    CalliPaint: Chinese Calligraphy Inpainting with Diffusion Model

    Authors: Qisheng Liao, Zhinuo Wang, Muhammad Abdul-Mageed, Gus Xia

    Abstract: Chinese calligraphy can be viewed as a unique form of visual art. Recent advancements in computer vision hold significant potential for the future development of generative models in the realm of Chinese calligraphy. Nevertheless, methods of Chinese calligraphy inpainting, which can be effectively used in the art and education fields, remain relatively unexplored. In this paper, we introduce a new… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: Accepted as a Machine Learning for Creativity and Design(ML4CD) workshop paper at NeruaIPS 2023. https://neurips.cc/virtual/2023/workshop/66545#wse-detail-75063

  15. arXiv:2311.09696  [pdf, other

    cs.CL

    Fumbling in Babel: An Investigation into ChatGPT's Language Identification Ability

    Authors: Wei-Rui Chen, Ife Adebara, Khai Duy Doan, Qisheng Liao, Muhammad Abdul-Mageed

    Abstract: ChatGPT has recently emerged as a powerful NLP tool that can carry out a variety of tasks. However, the range of languages ChatGPT can handle remains largely a mystery. To uncover which languages ChatGPT `knows', we investigate its language identification (LID) abilities. For this purpose, we compile Babel-670, a benchmark comprising 670 languages representing 24 language families spoken in five c… ▽ More

    Submitted 8 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: Accepted to NAACL 2024 Findings

  16. arXiv:2311.08844  [pdf, other

    cs.CV cs.CL

    Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder

    Authors: Abdelrahman Mohamed, Fakhraddin Alwajih, El Moatez Billah Nagoudi, Alcides Alcoba Inciarte, Muhammad Abdul-Mageed

    Abstract: Although image captioning has a vast array of applications, it has not reached its full potential in languages other than English. Arabic, for instance, although the native language of more than 400 million people, remains largely underrepresented in this area. This is due to the lack of labeled data and powerful Arabic generative models. We alleviate this issue by presenting a novel vision-langua… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

    Comments: Accepted in ArabicNLP Conference

  17. arXiv:2310.18778  [pdf, other

    cs.CL

    ProMap: Effective Bilingual Lexicon Induction via Language Model Prompting

    Authors: Abdellah El Mekki, Muhammad Abdul-Mageed, ElMoatez Billah Nagoudi, Ismail Berrada, Ahmed Khoumsi

    Abstract: Bilingual Lexicon Induction (BLI), where words are translated between two languages, is an important NLP task. While noticeable progress on BLI in rich resource languages using static word embeddings has been achieved. The word translation performance can be further improved by incorporating information from contextualized word embeddings. In this paper, we introduce ProMap, a novel approach for B… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Comments: To appear in IJCNLP-AACL 2023

  18. arXiv:2310.17333  [pdf, other

    cs.CL

    Arabic Fine-Grained Entity Recognition

    Authors: Haneen Liqreina, Mustafa Jarrar, Mohammed Khalilia, Ahmed Oumar El-Shangiti, Muhammad Abdul-Mageed

    Abstract: Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood,… ▽ More

    Submitted 18 December, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

  19. arXiv:2310.16712  [pdf, other

    cs.CL

    LLM Performance Predictors are good initializers for Architecture Search

    Authors: Ganesh Jawahar, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Dujian Ding

    Abstract: Large language models (LLMs) have become an integral component in solving a wide range of NLP tasks. In this work, we explore a novel use case of using LLMs to build performance predictors (PP): models that, given a specific deep neural network architecture, predict its performance on a downstream task. We design PP prompts for LLMs consisting of: (i) role: description of the role assigned to the… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

  20. arXiv:2310.16153  [pdf, other

    cs.CL

    WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task

    Authors: Mustafa Jarrar, Muhammad Abdul-Mageed, Mohammed Khalilia, Bashar Talafha, AbdelRahim Elmadany, Nagham Hamad, Alaa' Omar

    Abstract: We present WojoodNER-2023, the first Arabic Named Entity Recognition (NER) Shared Task. The primary focus of WojoodNER-2023 is on Arabic NER, offering novel NER datasets (i.e., Wojood) and the definition of subtasks designed to facilitate meaningful comparisons between different NER approaches. WojoodNER-2023 encompassed two Subtasks: FlatNER and NestedNER. A total of 45 unique teams registered fo… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  21. arXiv:2310.16127  [pdf, other

    cs.CL

    Octopus: A Multitask Model and Toolkit for Arabic Natural Language Generation

    Authors: AbdelRahim Elmadany, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

    Abstract: Understanding Arabic text and generating human-like responses is a challenging endeavor. While many researchers have proposed models and solutions for individual problems, there is an acute shortage of a comprehensive Arabic natural language generation toolkit that is capable of handling a wide range of tasks. In this work, we present a novel Arabic text-to-text Transformer model, namely AraT5v2.… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  22. arXiv:2310.16117  [pdf, other

    cs.CL

    NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

    Authors: Muhammad Abdul-Mageed, AbdelRahim Elmadany, Chiyu Zhang, El Moatez Billah Nagoudi, Houda Bouamor, Nizar Habash

    Abstract: We describe the findings of the fourth Nuanced Arabic Dialect Identification Shared Task (NADI 2023). The objective of NADI is to help advance state-of-the-art Arabic NLP by creating opportunities for teams of researchers to collaboratively compete under standardized conditions. It does so with a focus on Arabic dialects, offering novel datasets and defining subtasks that allow for meaningful comp… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: arXiv admin note: text overlap with arXiv:2210.09582

  23. arXiv:2310.14557  [pdf, other

    cs.CL

    The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64 Languages

    Authors: Chiyu Zhang, Khai Duy Doan, Qisheng Liao, Muhammad Abdul-Mageed

    Abstract: Instruction tuned large language models (LLMs), such as ChatGPT, demonstrate remarkable performance in a wide range of tasks. Despite numerous recent studies that examine the performance of instruction-tuned LLMs on various NLP benchmarks, there remains a lack of comprehensive investigation into their ability to understand cross-lingual sociopragmatic meaning (SM), i.e., meaning embedded within so… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: Accepted by EMNLP 2023 Main conference

  24. arXiv:2310.11069  [pdf, other

    cs.CL cs.SD eess.AS

    VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System

    Authors: Abdul Waheed, Bashar Talafha, Peter Sullivan, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: Arabic is a complex language with many varieties and dialects spoken by over 450 millions all around the world. Due to the linguistic diversity and variations, it is challenging to build a robust and generalized ASR system for Arabic. In this work, we address this gap by develo** and demoing a system, dubbed VoxArabica, for dialect identification (DID) as well as automatic speech recognition (AS… ▽ More

    Submitted 27 October, 2023; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: Accepted at ArabicNLP conference co-located with EMNLP'23. First three authors contributed equally

  25. arXiv:2308.04492  [pdf, other

    cs.AI

    ChatGPT for Arabic Grammatical Error Correction

    Authors: Sang Yun Kwon, Gagan Bhatia, El Moatez Billah Nagoud, Muhammad Abdul-Mageed

    Abstract: Recently, large language models (LLMs) fine-tuned to follow human instruction have exhibited significant capabilities in various English NLP tasks. However, their performance in grammatical error correction (GEC) tasks, particularly in non-English languages, remains significantly unexplored. In this paper, we delve into abilities of instruction fine-tuned LLMs in Arabic GEC, a task made complex du… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

  26. arXiv:2308.03051  [pdf, other

    cs.CL cs.LG

    TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties

    Authors: Karima Kadaoui, Samar M. Magdy, Abdul Waheed, Md Tawkat Islam Khondaker, Ahmed Oumar El-Shangiti, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

    Abstract: Despite the purported multilingual proficiency of instruction-finetuned large language models (LLMs) such as ChatGPT and Bard, the linguistic inclusivity of these models remains insufficiently explored. Considering this constraint, we present a thorough assessment of Bard and ChatGPT (encompassing both GPT-3.5 and GPT-4) regarding their machine translation proficiencies across ten varieties of Ara… ▽ More

    Submitted 23 October, 2023; v1 submitted 6 August, 2023; originally announced August 2023.

    Comments: ArabicNLP 2023

  27. arXiv:2306.04845  [pdf, other

    cs.CL

    Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

    Authors: Ganesh Jawahar, Haichuan Yang, Yunyang Xiong, Zechun Liu, Dilin Wang, Fei Sun, Meng Li, Aasish Pappu, Barlas Oguz, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Raghuraman Krishnamoorthi, Vikas Chandra

    Abstract: Weight-sharing supernet has become a vital component for performance estimation in the state-of-the-art (SOTA) neural architecture search (NAS) frameworks. Although supernet can directly generate different subnetworks without retraining, there is no guarantee for the quality of these subnetworks because of weight sharing. In NLP tasks such as machine translation and pre-trained language modeling,… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

  28. arXiv:2306.03789  [pdf, other

    eess.AS cs.CL cs.LG

    On the Robustness of Arabic Speech Dialect Identification

    Authors: Peter Sullivan, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: Arabic dialect identification (ADI) tools are an important part of the large-scale data collection pipelines necessary for training speech recognition models. As these pipelines require application of ADI tools to potentially out-of-domain data, we aim to investigate how vulnerable the tools may be to this domain shift. With self-supervised learning (SSL) models as a starting point, we evaluate tr… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  29. arXiv:2306.02902  [pdf, ps, other

    cs.CL cs.SD eess.AS

    N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition

    Authors: Bashar Talafha, Abdul Waheed, Muhammad Abdul-Mageed

    Abstract: Whisper, the recently developed multilingual weakly supervised model, is reported to perform well on multiple speech recognition benchmarks in both monolingual and multilingual settings. However, it is not clear how Whisper would fare under diverse conditions even on languages it was evaluated on such as Arabic. In this work, we address this gap by comprehensively evaluating Whisper on several var… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: 4 pages, INTERSPEECH 2023

  30. arXiv:2305.14989  [pdf, other

    cs.CL

    Dolphin: A Challenging and Diverse Benchmark for Arabic NLG

    Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Ahmed El-Shangiti, Muhammad Abdul-Mageed

    Abstract: We present Dolphin, a novel benchmark that addresses the need for a natural language generation (NLG) evaluation framework dedicated to the wide collection of Arabic languages and varieties. The proposed benchmark encompasses a broad range of 13 different NLG tasks, including dialogue generation, question answering, machine translation, summarization, among others. Dolphin comprises a substantial… ▽ More

    Submitted 24 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

  31. arXiv:2305.14976  [pdf, other

    cs.CL cs.LG

    GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP

    Authors: Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

    Abstract: ChatGPT's emergence heralds a transformative phase in NLP, particularly demonstrated through its excellent performance on many English benchmarks. However, the model's efficacy across diverse linguistic contexts remains largely uncharted territory. This work aims to bridge this knowledge gap, with a primary focus on assessing ChatGPT's capabilities on Arabic languages and dialectal varieties. Our… ▽ More

    Submitted 21 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 Main Conference

  32. arXiv:2304.14402  [pdf, other

    cs.CL

    LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions

    Authors: Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, Alham Fikri Aji

    Abstract: Large language models (LLMs) with instruction fine-tuning demonstrate superior generative capabilities. However, these models are resource-intensive. To alleviate this issue, we explore distilling knowledge from instruction-tuned LLMs into much smaller ones. To this end, we carefully develop a large set of 2.58M instructions based on both existing and newly-generated instructions. In addition to b… ▽ More

    Submitted 28 January, 2024; v1 submitted 27 April, 2023; originally announced April 2023.

    Comments: 21 pages, 8 figures, 17 tables, accepted by EACL2024 main conference

  33. arXiv:2304.13292  [pdf, other

    cs.CL

    Zero-Shot Slot and Intent Detection in Low-Resource Languages

    Authors: Sang Yun Kwon, Gagan Bhatia, El Moatez Billah Nagoudi, Alcides Alcoba Inciarte, Muhammad Abdul-Mageed

    Abstract: Intent detection and slot filling are critical tasks in spoken and natural language understanding for task-oriented dialog systems. In this work we describe our participation in the slot and intent detection for low-resource language varieties (SID4LR; Aepli et al. (2023)). We investigate the slot and intent detection (SID) tasks using a wide range of models and settings. Given the recent success… ▽ More

    Submitted 26 April, 2023; originally announced April 2023.

    Comments: VarDial @ EACL

  34. arXiv:2304.11256  [pdf, other

    cs.CL

    UBC-DLNLP at SemEval-2023 Task 12: Impact of Transfer Learning on African Sentiment Analysis

    Authors: Gagan Bhatia, Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: We describe our contribution to the SemEVAl 2023 AfriSenti-SemEval shared task, where we tackle the task of sentiment analysis in 14 different African languages. We develop both monolingual and multilingual models under a full supervised setting (subtasks A and B). We also develop models for the zero-shot setting (subtask C). Our approach involves experimenting with transfer learning using six lan… ▽ More

    Submitted 25 April, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

    Comments: AfriSenti 2023 @ ACL 2023

  35. arXiv:2212.10785  [pdf, other

    cs.CL cs.AI

    SERENGETI: Massively Multilingual Language Models for Africa

    Authors: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Alcides Alcoba Inciarte

    Abstract: Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by develo** SERENGETI, a massively multilingual language model that covers 517 African… ▽ More

    Submitted 26 May, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

    Comments: To appear in Findings of ACL 2023

  36. arXiv:2212.10758  [pdf, other

    cs.CL cs.AI

    ORCA: A Challenging Benchmark for Arabic Language Understanding

    Authors: AbdelRahim Elmadany, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

    Abstract: Due to their crucial role in all NLP, several benchmarks have been proposed to evaluate pretrained language models. In spite of these efforts, no public benchmark of diverse nature currently exists for evaluation of Arabic. This makes it challenging to measure progress for both Arabic and multilingual language models. This challenge is compounded by the fact that any benchmark targeting Arabic nee… ▽ More

    Submitted 29 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: All authors contributed equally. Accepted at ACL 2023, Toronto, Canada

  37. arXiv:2212.10755  [pdf, other

    cs.CL

    JASMINE: Arabic GPT Models for Few-Shot Learning

    Authors: El Moatez Billah Nagoudi, Muhammad Abdul-Mageed, AbdelRahim Elmadany, Alcides Alcoba Inciarte, Md Tawkat Islam Khondaker

    Abstract: Scholarship on generative pretraining (GPT) remains acutely Anglocentric, leaving serious gaps in our understanding of the whole class of autoregressive models. For example, we have little knowledge about the potential of these models and their societal impacts in diverse linguistic and cultural settings. We alleviate this issue for Arabic, a wide collection of languages and dialectal varieties wi… ▽ More

    Submitted 24 October, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

  38. arXiv:2211.06452  [pdf, other

    cs.CL cs.LG

    Cross-Platform and Cross-Domain Abusive Language Detection with Supervised Contrastive Learning

    Authors: Md Tawkat Islam Khondaker, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan

    Abstract: The prevalence of abusive language on different online platforms has been a major concern that raises the need for automated cross-platform abusive language detection. However, prior works focus on concatenating data from multiple platforms, inherently adopting Empirical Risk Minimization (ERM) method. In this work, we address this challenge from the perspective of domain generalization objective.… ▽ More

    Submitted 11 November, 2022; originally announced November 2022.

  39. arXiv:2210.12314  [pdf, other

    cs.CL

    A Benchmark Study of Contrastive Learning for Arabic Social Meaning

    Authors: Md Tawkat Islam Khondaker, El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan

    Abstract: Contrastive learning (CL) brought significant progress to various NLP tasks. Despite this progress, CL has not been applied to Arabic NLP to date. Nor is it clear how much benefits it could bring to particular classes of tasks such as those involved in Arabic social meaning (e.g., sentiment analysis, dialect identification, hate speech detection). In this work, we present a comprehensive benchmark… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

  40. arXiv:2210.11744  [pdf, other

    cs.CL cs.LG

    AfroLID: A Neural Language Identification Tool for African Languages

    Authors: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Alcides Alcoba Inciarte

    Abstract: Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 lang… ▽ More

    Submitted 6 December, 2022; v1 submitted 21 October, 2022; originally announced October 2022.

    Comments: To appear at EMNLP 2022 Main conference

  41. arXiv:2210.09582  [pdf, other

    cs.CL

    NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task

    Authors: Muhammad Abdul-Mageed, Chiyu Zhang, AbdelRahim Elmadany, Houda Bouamor, Nizar Habash

    Abstract: We describe findings of the third Nuanced Arabic Dialect Identification Shared Task (NADI 2022). NADI aims at advancing state of the art Arabic NLP, including on Arabic dialects. It does so by affording diverse datasets and modeling opportunities in a standardized context where meaningful comparisons between models and approaches are possible. NADI 2022 targeted both dialect identification (Subtas… ▽ More

    Submitted 20 October, 2022; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: arXiv admin note: text overlap with arXiv:2103.08466

  42. arXiv:2210.07535  [pdf, other

    cs.CL cs.LG

    AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation

    Authors: Ganesh Jawahar, Subhabrata Mukherjee, Xiaodong Liu, Young ** Kim, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah, Sebastien Bubeck, Jianfeng Gao

    Abstract: Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. Furthermore, existing MoE works do not consider computational constraints (e.g., FLOPs, latency) to guide their design. To this e… ▽ More

    Submitted 7 June, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

    Comments: ACL 2023 Findings

  43. arXiv:2210.03251  [pdf, other

    cs.CL

    Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints

    Authors: Ganesh Jawahar, Subhabrata Mukherjee, Debadeepta Dey, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Caio Cesar Teodoro Mendes, Gustavo Henrique de Rosa, Shital Shah

    Abstract: Autocomplete is a task where the user inputs a piece of text, termed prompt, which is conditioned by the model to generate semantically coherent continuation. Existing works for this task have primarily focused on datasets (e.g., email, chat) with high frequency user prompt patterns (or focused prompts) where word-based language models have been quite effective. In this work, we study the more cha… ▽ More

    Submitted 7 June, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: SustaiNLP 2023

  44. arXiv:2206.03933  [pdf, other

    cs.CL cs.AI cs.LG

    TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation

    Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations… ▽ More

    Submitted 27 May, 2022; originally announced June 2022.

    Comments: All authors contributed equally

    Journal ref: Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT5), 2022

  45. arXiv:2205.06993  [pdf, other

    cs.CL

    Improving Neural Machine Translation of Indigenous Languages with Multilingual Transfer Learning

    Authors: Wei-Rui Chen, Muhammad Abdul-Mageed

    Abstract: Machine translation (MT) involving Indigenous languages, including those possibly endangered, is challenging due to lack of sufficient parallel data. We describe an approach exploiting bilingual and multilingual pretrained MT models in a transfer learning setting to translate from Spanish to ten South American Indigenous languages. Our models set new SOTA on five out of the ten language pairs we c… ▽ More

    Submitted 14 May, 2022; originally announced May 2022.

  46. arXiv:2204.04611  [pdf, other

    cs.CL cs.AI

    Decay No More: A Persistent Twitter Dataset for Learning Social Meaning

    Authors: Chiyu Zhang, Muhammad Abdul-Mageed, El Moatez Billah Nagoudi

    Abstract: With the proliferation of social media, many studies resort to social media to construct datasets for develo** social meaning understanding systems. For the popular case of Twitter, most researchers distribute tweet IDs without the actual text contents due to the data distribution policy of the platform. One issue is that the posts become increasingly inaccessible over time, which leads to unfai… ▽ More

    Submitted 7 May, 2022; v1 submitted 10 April, 2022; originally announced April 2022.

    Comments: 1st Workshop on Novel Evaluation Approaches for Text Classification Systems on Social Media (NEATCLasS) colocated at ICWSM 2022. arXiv admin note: text overlap with arXiv:2108.00356

  47. arXiv:2203.10343  [pdf, other

    cs.CL cs.AI

    Automatic Detection of Entity-Manipulated Text using Factual Knowledge

    Authors: Ganesh Jawahar, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan

    Abstract: In this work, we focus on the problem of distinguishing a human written news article from a news article that is created by manipulating entities in a human written news article (e.g., replacing entities with factually incorrect entities). Such manipulated articles can mislead the reader by posing as a human written news article. We propose a neural network based detector that detects manipulated… ▽ More

    Submitted 19 March, 2022; originally announced March 2022.

    Comments: Association for Computational Linguistics (ACL) 2022 camera-ready

  48. arXiv:2203.08351  [pdf, other

    cs.CL cs.AI

    Towards Afrocentric NLP for African Languages: Where We Are and Where We Can Go

    Authors: Ife Adebara, Muhammad Abdul-Mageed

    Abstract: Aligning with ACL 2022 special Theme on "Language Diversity: from Low Resource to Endangered Languages", we discuss the major linguistic and sociopolitical challenges facing development of NLP technologies for African languages. Situating African languages in a typological framework, we discuss how the particulars of these languages can be harnessed. To facilitate future research, we also highligh… ▽ More

    Submitted 17 March, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

    Comments: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)

  49. arXiv:2203.07648  [pdf, other

    cs.CL cs.AI

    Contrastive Learning of Sociopragmatic Meaning in Social Media

    Authors: Chiyu Zhang, Muhammad Abdul-Mageed, Ganesh Jawahar

    Abstract: Recent progress in representation and contrastive learning in NLP has not widely considered the class of \textit{sociopragmatic meaning} (i.e., meaning in interaction within different language communities). To bridge this gap, we propose a novel framework for learning task-agnostic representations transferable to a wide range of sociopragmatic tasks (e.g., emotion, hate speech, humor, sarcasm). Ou… ▽ More

    Submitted 24 May, 2023; v1 submitted 15 March, 2022; originally announced March 2022.

    Comments: Final camera-ready version for ACL2023

  50. arXiv:2202.05209  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

    Authors: Peter Sullivan, Toshiko Shibano, Muhammad Abdul-Mageed

    Abstract: ASR systems designed for native English (L1) usually underperform on non-native English (L2). To address this performance gap, \textbf{(i)} we extend our previous work to investigate fine-tuning of a pre-trained wav2vec 2.0 model \cite{baevski2020wav2vec,xu2021self} under a rich set of L1 and L2 training conditions. We further \textbf{(ii)} incorporate language model decoding in the ASR system, al… ▽ More

    Submitted 10 February, 2022; originally announced February 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2110.00678