University of Cape Town's WMT22 System: Multilingual Machine Translation for Southern African Languages
Authors:
Khalid N. Elmadani,
Francois Meyer,
Jan Buys
Abstract:
The paper describes the University of Cape Town's submission to the constrained track of the WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African Languages. Our system is a single multilingual translation model that translates between English and 8 South / South East African Languages, as well as between specific pairs of the African languages. We used several techniques suite…
▽ More
The paper describes the University of Cape Town's submission to the constrained track of the WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African Languages. Our system is a single multilingual translation model that translates between English and 8 South / South East African Languages, as well as between specific pairs of the African languages. We used several techniques suited for low-resource machine translation (MT), including overlap BPE, back-translation, synthetic training data generation, and adding more translation directions during training. Our results show the value of these techniques, especially for directions where very little or no bilingual training data is available.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
Masader Plus: A New Interface for Exploring +500 Arabic NLP Datasets
Authors:
Yousef Altaher,
Ali Fadel,
Mazen Alotaibi,
Mazen Alyazidi,
Mishari Al-Mutairi,
Mutlaq Aldhbuiub,
Abdulrahman Mosaibah,
Abdelrahman Rezk,
Abdulrazzaq Alhendi,
Mazen Abo Shal,
Emad A. Alghamdi,
Maged S. Alshaibani,
Jezia Zakraoui,
Wafaa Mohammed,
Kamel Gaanoun,
Khalid N. Elmadani,
Mustafa Ghaleb,
Nouamane Tazi,
Raed Alharbi,
Maraim Masoud,
Zaid Alyafeai
Abstract:
Masader (Alyafeai et al., 2021) created a metadata structure to be used for cataloguing Arabic NLP datasets. However, develo** an easy way to explore such a catalogue is a challenging task. In order to give the optimal experience for users and researchers exploring the catalogue, several design and user experience challenges must be resolved. Furthermore, user interactions with the website may p…
▽ More
Masader (Alyafeai et al., 2021) created a metadata structure to be used for cataloguing Arabic NLP datasets. However, develo** an easy way to explore such a catalogue is a challenging task. In order to give the optimal experience for users and researchers exploring the catalogue, several design and user experience challenges must be resolved. Furthermore, user interactions with the website may provide an easy approach to improve the catalogue. In this paper, we introduce Masader Plus, a web interface for users to browse Masader. We demonstrate data exploration, filtration, and a simple API that allows users to examine datasets from the backend. Masader Plus can be explored using this link https://arbml.github.io/masader. A video recording explaining the interface can be found here https://www.youtube.com/watch?v=SEtdlSeqchk.
△ Less
Submitted 1 August, 2022;
originally announced August 2022.
BERT Fine-tuning For Arabic Text Summarization
Authors:
Khalid N. Elmadani,
Mukhtar Elgezouli,
Anas Showk
Abstract:
Fine-tuning a pretrained BERT model is the state of the art method for extractive/abstractive text summarization, in this paper we showcase how this fine-tuning method can be applied to the Arabic language to both construct the first documented model for abstractive Arabic text summarization and show its performance in Arabic extractive summarization. Our model works with multilingual BERT (as Ara…
▽ More
Fine-tuning a pretrained BERT model is the state of the art method for extractive/abstractive text summarization, in this paper we showcase how this fine-tuning method can be applied to the Arabic language to both construct the first documented model for abstractive Arabic text summarization and show its performance in Arabic extractive summarization. Our model works with multilingual BERT (as Arabic language does not have a pretrained BERT of its own). We show its performance in English corpus first before applying it to Arabic corpora in both extractive and abstractive tasks.
△ Less
Submitted 29 March, 2020;
originally announced April 2020.