-
Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training
Authors:
Mohammad Majd Saad Al Deen,
Maren Pielka,
Jörn Hees,
Bouthaina Soulef Abdou,
Rafet Sifa
Abstract:
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP), with a particular focus on Natural Language Inference (NLI) and Contradiction Detection (CD). Arabic is considered a resource-poor language, meaning that there are few data sets available, which leads to limited availability of NLP methods. To overcome this limitation, we create a dedicat…
▽ More
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP), with a particular focus on Natural Language Inference (NLI) and Contradiction Detection (CD). Arabic is considered a resource-poor language, meaning that there are few data sets available, which leads to limited availability of NLP methods. To overcome this limitation, we create a dedicated data set from publicly available resources. Subsequently, transformer-based machine learning models are being trained and evaluated. We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches, when we apply linguistically informed pre-training methods such as Named Entity Recognition (NER). To our knowledge, this is the first large-scale evaluation for this task in Arabic, as well as the first application of multi-task pre-training in this context.
△ Less
Submitted 27 July, 2023;
originally announced July 2023.
-
Improving Dialogue Act Classification for Spontaneous Arabic Speech and Instant Messages at Utterance Level
Authors:
AbdelRahim Elmadany,
Sherif Abdou,
Mervat Gheith
Abstract:
The ability to model and automatically detect dialogue act is an important step toward understanding spontaneous speech and Instant Messages. However, it has been difficult to infer a dialogue act from a surface utterance because it highly depends on the context of the utterance and speaker linguistic knowledge; especially in Arabic dialects. This paper proposes a statistical dialogue analysis mod…
▽ More
The ability to model and automatically detect dialogue act is an important step toward understanding spontaneous speech and Instant Messages. However, it has been difficult to infer a dialogue act from a surface utterance because it highly depends on the context of the utterance and speaker linguistic knowledge; especially in Arabic dialects. This paper proposes a statistical dialogue analysis model to recognize utterance's dialogue acts using a multi-classes hierarchical structure. The model can automatically acquire probabilistic discourse knowledge from a dialogue corpus were collected and annotated manually from multi-genre Egyptian call-centers. Extensive experiments were conducted using Support Vector Machines classifier to evaluate the system performance. The results attained in the term of average F-measure scores of 0.912; showed that the proposed approach has moderately improved F-measure by approximately 20%.
△ Less
Submitted 30 May, 2018;
originally announced June 2018.
-
Towards Understanding Egyptian Arabic Dialogues
Authors:
Abdelrahim A Elmadany,
Sherif M Abdou,
Mervat Gheith
Abstract:
Labelling of user's utterances to understanding his attends which called Dialogue Act (DA) classification, it is considered the key player for dialogue language understanding layer in automatic dialogue systems. In this paper, we proposed a novel approach to user's utterances labeling for Egyptian spontaneous dialogues and Instant Messages using Machine Learning (ML) approach without relying on an…
▽ More
Labelling of user's utterances to understanding his attends which called Dialogue Act (DA) classification, it is considered the key player for dialogue language understanding layer in automatic dialogue systems. In this paper, we proposed a novel approach to user's utterances labeling for Egyptian spontaneous dialogues and Instant Messages using Machine Learning (ML) approach without relying on any special lexicons, cues, or rules. Due to the lack of Egyptian dialect dialogue corpus, the system evaluated by multi-genre corpus includes 4725 utterances for three domains, which are collected and annotated manually from Egyptian call-centers. The system achieves F1 scores of 70. 36% overall domains.
△ Less
Submitted 13 July, 2015;
originally announced September 2015.
-
Idioms-Proverbs Lexicon for Modern Standard Arabic and Colloquial Sentiment Analysis
Authors:
Hossam S. Ibrahim,
Sherif M. Abdou,
Mervat Gheith
Abstract:
Although, the fair amount of works in sentiment analysis (SA) and opinion mining (OM) systems in the last decade and with respect to the performance of these systems, but it still not desired performance, especially for morphologically-Rich Language (MRL) such as Arabic, due to the complexities and challenges exist in the nature of the languages itself. One of these challenges is the detection of…
▽ More
Although, the fair amount of works in sentiment analysis (SA) and opinion mining (OM) systems in the last decade and with respect to the performance of these systems, but it still not desired performance, especially for morphologically-Rich Language (MRL) such as Arabic, due to the complexities and challenges exist in the nature of the languages itself. One of these challenges is the detection of idioms or proverbs phrases within the writer text or comment. An idiom or proverb is a form of speech or an expression that is peculiar to itself. Grammatically, it cannot be understood from the individual meanings of its elements and can yield different sentiment when treats as separate words. Consequently, In order to facilitate the task of detection and classification of lexical phrases for automated SA systems, this paper presents AIPSeLEX a novel idioms/ proverbs sentiment lexicon for modern standard Arabic (MSA) and colloquial. AIPSeLEX is manually collected and annotated at sentence level with semantic orientation (positive or negative). The efforts of manually building and annotating the lexicon are reported. Moreover, we build a classifier that extracts idioms and proverbs, phrases from text using n-gram and similarity measure methods. Finally, several experiments were carried out on various data, including Arabic tweets and Arabic microblogs (hotel reservation, product reviews, and TV program comments) from publicly available Arabic online reviews websites (social media, blogs, forums, e-commerce web sites) to evaluate the coverage and accuracy of AIPSeLEX.
△ Less
Submitted 5 June, 2015;
originally announced June 2015.
-
Arabic Inquiry-Answer Dialogue Acts Annotation Schema
Authors:
AbdelRahim A. Elmadany,
Sherif M. Abdou,
Mervat Gheith
Abstract:
We present an annotation schema as part of an effort to create a manually annotated corpus for Arabic dialogue language understanding including spoken dialogue and written "chat" dialogue for inquiry-answer domain. The proposed schema handles mainly the request and response acts that occurs frequently in inquiry-answer debate conversations expressing request services, suggests, and offers. We appl…
▽ More
We present an annotation schema as part of an effort to create a manually annotated corpus for Arabic dialogue language understanding including spoken dialogue and written "chat" dialogue for inquiry-answer domain. The proposed schema handles mainly the request and response acts that occurs frequently in inquiry-answer debate conversations expressing request services, suggests, and offers. We applied the proposed schema on 83 Arabic inquiry-answer dialogues.
△ Less
Submitted 15 May, 2015;
originally announced May 2015.
-
Sentiment Analysis For Modern Standard Arabic And Colloquial
Authors:
Hossam S. Ibrahim,
Sherif M. Abdou,
Mervat Gheith
Abstract:
The rise of social media such as blogs and social networks has fueled interest in sentiment analysis. With the proliferation of reviews, ratings, recommendations and other forms of online expression, online opinion has turned into a kind of virtual currency for businesses looking to market their products, identify new opportunities and manage their reputations, therefore many are now looking to th…
▽ More
The rise of social media such as blogs and social networks has fueled interest in sentiment analysis. With the proliferation of reviews, ratings, recommendations and other forms of online expression, online opinion has turned into a kind of virtual currency for businesses looking to market their products, identify new opportunities and manage their reputations, therefore many are now looking to the field of sentiment analysis. In this paper, we present a feature-based sentence level approach for Arabic sentiment analysis. Our approach is using Arabic idioms/saying phrases lexicon as a key importance for improving the detection of the sentiment polarity in Arabic sentences as well as a number of novels and rich set of linguistically motivated features contextual Intensifiers, contextual Shifter and negation handling), syntactic features for conflicting phrases which enhance the sentiment classification accuracy. Furthermore, we introduce an automatic expandable wide coverage polarity lexicon of Arabic sentiment words. The lexicon is built with gold-standard sentiment words as a seed which is manually collected and annotated and it expands and detects the sentiment orientation automatically of new sentiment words using synset aggregation technique and free online Arabic lexicons and thesauruses. Our data focus on modern standard Arabic (MSA) and Egyptian dialectal Arabic tweets and microblogs (hotel reservation, product reviews, etc.). The experimental results using our resources and techniques with SVM classifier indicate high performance levels, with accuracies of over 95%.
△ Less
Submitted 12 May, 2015;
originally announced May 2015.
-
A Survey of Arabic Dialogues Understanding for Spontaneous Dialogues and Instant Message
Authors:
AbdelRahim A. Elmadany,
Sherif M. Abdou,
Mervat Gheith
Abstract:
Building dialogues systems interaction has recently gained considerable attention, but most of the resources and systems built so far are tailored to English and other Indo-European languages. The need for designing systems for other languages is increasing such as Arabic language. For this reasons, there are more interest for Arabic dialogue acts classification task because it a key player in Ara…
▽ More
Building dialogues systems interaction has recently gained considerable attention, but most of the resources and systems built so far are tailored to English and other Indo-European languages. The need for designing systems for other languages is increasing such as Arabic language. For this reasons, there are more interest for Arabic dialogue acts classification task because it a key player in Arabic language understanding to building this systems. This paper surveys different techniques for dialogue acts classification for Arabic. We describe the main existing techniques for utterances segmentations and classification, annotation schemas, and test corpora for Arabic dialogues understanding that have introduced in the literature
△ Less
Submitted 12 May, 2015;
originally announced May 2015.
-
Turn Segmentation into Utterances for Arabic Spontaneous Dialogues and Instance Messages
Authors:
AbdelRahim A. Elmadany,
Sherif M. Abdou,
Mervat Gheith
Abstract:
Text segmentation task is an essential processing task for many of Natural Language Processing (NLP) such as text summarization, text translation, dialogue language understanding, among others. Turns segmentation considered the key player in dialogue understanding task for building automatic Human-Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances f…
▽ More
Text segmentation task is an essential processing task for many of Natural Language Processing (NLP) such as text summarization, text translation, dialogue language understanding, among others. Turns segmentation considered the key player in dialogue understanding task for building automatic Human-Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for Egyptian spontaneous dialogues and Instance Messages (IM) using Machine Learning (ML) approach as a part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian dialect dialogue corpus the system evaluated by our corpus includes 3001 turns, which are collected, segmented, and annotated manually from Egyptian call-centers. The system achieves F1 scores of 90.74% and accuracy of 95.98%.
△ Less
Submitted 12 May, 2015;
originally announced May 2015.
-
AltecOnDB: A Large-Vocabulary Arabic Online Handwriting Recognition Database
Authors:
Ibrahim Abdelaziz,
Sherif Abdou
Abstract:
Arabic is a semitic language characterized by a complex and rich morphology. The exceptional degree of ambiguity in the writing system, the rich morphology, and the highly complex word formation process of roots and patterns all contribute to making computational approaches to Arabic very challenging. As a result, a practical handwriting recognition system should support large vocabulary to provid…
▽ More
Arabic is a semitic language characterized by a complex and rich morphology. The exceptional degree of ambiguity in the writing system, the rich morphology, and the highly complex word formation process of roots and patterns all contribute to making computational approaches to Arabic very challenging. As a result, a practical handwriting recognition system should support large vocabulary to provide a high coverage and use the context information for disambiguation. Several research efforts have been devoted for building online Arabic handwriting recognition systems. Most of these methods are either using their small private test data sets or a standard database with limited lexicon and coverage. A large scale handwriting database is an essential resource that can advance the research of online handwriting recognition. Currently, there is no online Arabic handwriting database with large lexicon, high coverage, large number of writers and training/testing data.
In this paper, we introduce AltecOnDB, a large scale online Arabic handwriting database. AltecOnDB has 98% coverage of all the possible PAWS of the Arabic language. The collected samples are complete sentences that include digits and punctuation marks. The collected data is available on sentence, word and character levels, hence, high-level linguistic models can be used for performance improvements. Data is collected from more than 1000 writers with different backgrounds, genders and ages. Annotation and verification tools are developed to facilitate the annotation and verification phases. We built an elementary recognition system to test our database and show the existing difficulties when handling a large vocabulary and dealing with large amounts of styles variations in the collected data.
△ Less
Submitted 24 December, 2014;
originally announced December 2014.
-
Large Vocabulary Arabic Online Handwriting Recognition System
Authors:
Ibrahim Abdelaziz,
Sherif Abdou,
Hassanin Al-Barhamtoshy
Abstract:
Arabic handwriting is a consonantal and cursive writing. The analysis of Arabic script is further complicated due to obligatory dots/strokes that are placed above or below most letters and usually written delayed in order. Due to ambiguities and diversities of writing styles, recognition systems are generally based on a set of possible words called lexicon. When the lexicon is small, recognition a…
▽ More
Arabic handwriting is a consonantal and cursive writing. The analysis of Arabic script is further complicated due to obligatory dots/strokes that are placed above or below most letters and usually written delayed in order. Due to ambiguities and diversities of writing styles, recognition systems are generally based on a set of possible words called lexicon. When the lexicon is small, recognition accuracy is more important as the recognition time is minimal. On the other hand, recognition speed as well as the accuracy are both critical when handling large lexicons. Arabic is rich in morphology and syntax which makes its lexicon large. Therefore, a practical online handwriting recognition system should be able to handle a large lexicon with reasonable performance in terms of both accuracy and time. In this paper, we introduce a fully-fledged Hidden Markov Model (HMM) based system for Arabic online handwriting recognition that provides solutions for most of the difficulties inherent in recognizing the Arabic script. A new preprocessing technique for handling the delayed strokes is introduced. We use advanced modeling techniques for building our recognition system from the training data to provide more detailed representation for the differences between the writing units, minimize the variances between writers in the training data and have a better representation for the features space. System results are enhanced using an additional post-processing step with a higher order language model and cross-word HMM models. The system performance is evaluated using two different databases covering small and large lexicons. Our system outperforms the state-of-art systems for the small lexicon database. Furthermore, it shows promising results (accuracy and time) when supporting large lexicon with the possibility for adapting the models for specific writers to get even better results.
△ Less
Submitted 17 October, 2015; v1 submitted 17 October, 2014;
originally announced October 2014.