Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
Authors:
Saied Alshahrani,
Hesham Haroon,
Ali Elfilali,
Mariama Njie,
Jeanna Matthews
Abstract:
Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY), and documented issues in the Egyptian Arabic Wikipedia edition r…
▽ More
Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY), and documented issues in the Egyptian Arabic Wikipedia edition regarding the massive automatic creation of its articles using template-based translation from English to Arabic without human involvement, overwhelming the Egyptian Arabic Wikipedia with articles that do not only have low-quality content but also with articles that do not represent the Egyptian people, their culture, and their dialect. In this paper, we aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics through exploratory analysis and building automatic detection systems. We first explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and utilize the resulting insights to build multivariate machine learning classifiers leveraging articles' metadata to detect the template-translated articles automatically. We then publicly deploy and host the best-performing classifier, XGBoost, as an online application called EGYPTIAN WIKIPEDIA SCANNER and release the extracted, filtered, and labeled datasets to the research community to benefit from our datasets and the online, web-based detection system.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
A method to identify potential ambiguous Malay words through Ambiguity Attributes map**: An exploratory Study
Authors:
Hazlina Haron,
Abdul Azim Abd. Ghani
Abstract:
We describe here a methodology to identify a list of ambiguous Malay words that are commonly being used in Malay documentations such as Requirement Specification. We compiled several relevant and appropriate requirement quality attributes and sentence rules from previous literatures and adopt it to come out with a set of ambiguity attributes that most suit Malay words. The extracted Malay ambiguou…
▽ More
We describe here a methodology to identify a list of ambiguous Malay words that are commonly being used in Malay documentations such as Requirement Specification. We compiled several relevant and appropriate requirement quality attributes and sentence rules from previous literatures and adopt it to come out with a set of ambiguity attributes that most suit Malay words. The extracted Malay ambiguous words (potential) are then being mapped onto the constructed ambiguity attributes to confirm their vagueness. The list is then verified by Malay linguist experts. This paper aims to identify a list of potential ambiguous words in Malay as an attempt to assist writers to avoid using the vague words while documenting Malay Requirement Specification as well as to any other related Malay documentation. The result of this study is a list of 120 potential ambiguous Malay words that could act as guidelines in writing Malay sentences
△ Less
Submitted 26 February, 2014;
originally announced February 2014.