Arabic Text Diacritization In The Age Of Transfer Learning: Token Classification Is All You Need

Abderrahman Skiredj
OCP Solutions and College of Computing
Mohammed VI Polytechnic University
[email protected]
& Ismail Berrada
College of Computing
Mohammed VI Polytechnic University
[email protected] Abderrahman Skiredj OCP Solutions and College of Computing, Mohammed VI Polytechnic University; Email: [email protected] Ismail Berrada College of Computing, Mohammed VI Polytechnic University; Email: [email protected]

Abstract

Automatic diacritization of Arabic text involves adding diacritical marks (diacritics) to the text. This task poses a significant challenge with noteworthy implications for computational processing and comprehension. In this paper, we introduce PTCAD (Pre-FineTuned Token Classification for Arabic Diacritization, a novel two-phase approach for the Arabic Text Diacritization task. PTCAD comprises a pre-finetuning phase and a finetuning phase, treating Arabic Text Diacritization as a token classification task for pre-trained models. The effectiveness of PTCAD is demonstrated through evaluations on two benchmark datasets derived from the Tashkeela dataset, where it achieves state-of-the-art results, including a 20% reduction in Word Error Rate (WER) compared to existing benchmarks and superior performance over GPT-4 in ATD tasks.

1 Introduction

Natural Language Processing (NLP) has experienced marked success in recent years, resha** a myriad of applications such as text analysis Howard and Ruder, (2018), machine translation Vaswani et al., (2017), Large Language Models (LLMs), and conversational models OpenAI, (2023). These strides have not only heightened the automation and effectiveness of textual data processes but have also played a pivotal role in fields like education Khan et al., (2023) and customer service Al et al., (2022) among others. While the majority of these developments have been in the context of the English language, given its prevalent usage and vast data availability, it is evident that unique linguistic characteristics inherent in other languages necessitate tailored approaches for comparable success.

In the realm of Arabic NLP, notable advancements have been achieved, encompassing dialect identification (El Mekki et al.,, 2020; El Mekki et al., 2021b, ), sentiment analysis (Mahdaouy et al.,, 2021), and domain adaptation (El Mekki et al., 2021a, ). Yet, Arabic, a Semitic language, possesses linguistic nuances that diverge greatly from English. A prime example of this is the role of diacritics in Arabic (Almanea,, 2021). These diminutive marks, placed above or below characters, are indispensable for clarifying meaning and pronunciation. In Modern Standard Arabic (MSA) Ryding, (2005), the frequent omission of diacritics introduces a notable level of ambiguity in written form. Such omissions pose challenges to NLP tasks, highlighting the importance of Arabic Text Diacritization (ATD) as a crucial task in Arabic NLP. ATD is significant for several reasons:

•

Disambiguation. Arabic is typically written without diacritics, which are marks that indicate the vowels and pronunciation of words. The absence of diacritics can lead to ambiguity in the meaning of words, as many Arabic words share the same root consonants. Diacritization helps disambiguate these words, improving overall comprehension (Al-Hajj and Al-Rawi,, 2004; Habash et al.,, 2007).
•

NLP Applications. Many NLP applications, such as Text-To-Speech (TTS) generation, machine translation, reading comprehension, and Part-Of-Speech (POS) tagging, rely on accurately understanding the structure and meaning of text. Diacritization enhances the performance of these applications by providing more context and improving the accuracy of language processing (Hassan and Hassan,, 2010; Abdul-Mageed et al.,, 2012).
•

Accessibility. Diacritization makes Arabic text more accessible to learners and individuals who are not native speakers. It aids in pronunciation and comprehension, making it easier for non-Arabic speakers to understand and learn the language (Taji et al.,, 2014).
•

Search and Information Retrieval. Diacritization can improve the accuracy of search and information retrieval systems. When diacritized, Arabic text becomes more searchable, hel** users find relevant information more efficiently (Abdul-Mageed et al.,, 2012).
•

Preservation of Cultural and Linguistic Heritage. Diacritization contributes to the preservation of the richness and subtleties of the Arabic language. It helps maintain the integrity of classical texts and allows for a more accurate representation of the language’s historical and cultural nuances (Taji et al.,, 2014).

Thus, Arabic Text Diacritization is crucial for enhancing language processing applications, improving accessibility for learners, facilitating accurate information retrieval, and preserving the linguistic and cultural heritage of the Arabic language (Al-Hajj and Al-Rawi,, 2004; Habash et al.,, 2007; Hassan and Hassan,, 2010; Abdul-Mageed et al.,, 2012; Taji et al.,, 2014).

Prior work has explored various approaches to address the ATD task, spanning from rule-based methods to classical machine learning models, and contemporary deep learning architectures (Almanea,, 2021), often integrated with pre-existing linguistic knowledge (Zalmout and Habash,, 2020). Significant progress has notably been achieved in the diacritization of both Classical Arabic (CA) and MSA. For instance, in CA, Abbad and Xiong, (2020) attained a Diacritic Error Rate (DER) of 3.39% and a Word Error Rate (WER) of 9.94% on the cleaned version of the Tashkeela dataset, using a multi-layer Recurrent Neural Network (RNN) in conjunction with rule-based and statistical correction components. Similarly, in MSA, Zalmout and Habash, (2020) employed a joint modeling approach, achieving an impressive diacritization accuracy of 93.9% in the Penn Arabic Treebank (PATB) dataset Maamouri et al., (2004). Despite these advancements, a significant challenge persists: the complexity and richness of the Arabic language necessitate a deep contextual understanding to achieve accurate diacritization. Furthermore, existing large language models, such as ChatGPT (OpenAI,, 2023), often struggle with handling the inherent ambiguity present in undiacritized text. This challenge becomes particularly evident when models are trained on data that is either inaccurate or inconsistent in terms of quality and standards. Specifically, ChatGPT encounters difficulties in achieving precise diacritization and tends to demonstrate a lack of full comprehension when processing complex Arabic sentences devoid of diacritics.

To tackle these challenges, this paper introduces PTCAD (Pre-FineTuned Token Classification for Arabic Diacritization), a novel two-phase approach to ATD. The core idea of PTCAD revolves around framing the ATD task as a finetuning task for pre-trained BERT-like models, leveraging their robustness in encapsulating contextual information. Initially, PTCAD undergoes simultaneous pre-finetuning on linguistically relevant tasks, such as finetuning on CA texts, POS tagging, segmentation, and text diacritization, all framed as MLM tasks. This enriches the model’s contextual understanding by integrating knowledge from these tasks. Following this, PTCAD progresses into a finetuning phase, where ATD is treated as a token classification task. This phase capitalizes on the contextual groundwork laid in the pre-finetuning phase, refining the model’s ability to accurately diacritize Arabic text. Our rigorous evaluation of PTCAD on two widely recognized benchmark datasets — the cleaned version of Tashkeela provided by Abbad and Xiong, (2020) and the version by Fadel et al., 2019a — showcases its effectiveness. Remarkably, PTCAD achieves a 20% reduction in WER compared to the state-of-the-art models on both benchmarks, underlining the success of our methodology. Furthermore, through an ablation study, we emphasize the pivotal role of the pre-finetuning phase in sha** the overall performance of PTCAD.

Moreover, our comprehensive error analysis offers crucial insights into the types and sources of errors, guiding future refinements and developments in Arabic diacritization methods. Alongside these primary evaluations, we conducted an assessment of GPT-4’s capability in the ATD task. In contrast to PTCAD, GPT-4 exhibited lower performance levels, with a DER of 20% and a WER of 30% on the benchmark dataset provided by Fadel et al., 2019a . This comparison underscores the specialized efficiency of PTCAD in handling the complexities of Arabic diacritization compared to general-purpose models like GPT-4. So, the main contributions of our paper can be succinctly summarized as follows:

•

Introduction of PTCAD, a two-phase training methodology for the ATD task. The first phase, pre-finetuning, integrates learning from linguistically relevant tasks to enhance contextual understanding. The second phase involves finetuning ATD as a token classification task for pre-trained BERT-like models.
•

Effectiveness of PTCAD demonstrated through benchmarking on two standard datasets: the cleaned versions of Tashkeela by Abbad and Xiong, (2020) and Fadel et al., 2019a . Notably, PTCAD achieves a 20% reduction in WER compared to the state-of-the-art (SOTA) on both datasets. Additionally, in our assessments, GPT-4 showed comparatively lower performance in the ATD task.
•

An ablation study revealing the efficacy of the pre-finetuning phase, highlighting the importance of multi-task learning in Phase 1 for overall performance enhancement. The study is complemented by an error analysis to identify and address sources of errors, further substantiating our training strategy’s effectiveness.

The rest of the paper is organized as follows. Section 2 provides a comprehensive review of relevant literature on Arabic diacritization. Section 3 introduces our proposed modeling approach. Section 4 presents the datasets used and the evaluation metrics employed in our study. In Section 5, we present the experimental results, including an ablation study that assesses the significance of multi-task training. Section 6 presents the advantages and limitations of our approach. Finally, the last section summarizes the study’s key findings and discusses future research directions.

2 Related Work

The development of automatic diacritization of Arabic text has been explored in various studies. This survey categorizes related works into two groups based on the type of Arabic language data used, namely CA and MSA. Within each category, the methods are arranged chronologically and benchmarked on similar datasets to present a clear progression of the evolving research in this domain.

Focusing on CA (Table 2). Abbad and Xiong, (2020) utilized the cleaned version of the Tashkeela dataset (Abbad and Xiong,, 2020), achieving a DER of 3.39% and a WER of 9.94%. This result was reached using a deep learning model composed of a multi-layer RNN with LSTM and dense layers, combined with rule-based and statistical correction components. Madhfar and Qamar, (2020) reported enhanced results on the same dataset, obtaining a WER of 4.43% and DER of 1.13%. They employed a character-level Convolutional Bank and Highway network architecture followed by a Bidirectional GRU module. Subsequently, Abbad and Xiong, (2021) used a subset of the data from Abbad and Xiong, (2020) and implemented a model with character-level embeddings of size 128 and 4 Bi-LSTM hidden layers. They divided diacritics into four groups and applied a sliding window operation for data generation, resulting in a DER of 3-3.6% and a WER of 8.55-8.99%, considering both the inclusion and exclusion of diacritics. Abandah et al., 2022a and Abandah et al., 2022b used the Arabic Poem Comprehensive Dataset 2 (APCD2) (Yousef et al.,, 2019) for training. The former study employed the model developed by Abandah and Abdel-Karim (Gheith Abandah,, 2020), achieving significant results with a WER of 20.40% and a DER of 6.08% on a specific test subset of APCD2. This subset consisted of text samples with a diacritic-to-letters ratio of 50% or higher. On the other hand, the latter study augmented diacritization models with a meter classification model. For the first test subset with a diacritics-to-letters ratio of 50% or higher, they achieved a DER of 4.46% and a WER of 15.43%. Furthermore, they specifically focused on a separate test subset of APCD2 containing text samples with a diacritics-to-letters ratio of 67% or higher, obtaining a DER of 3.54% and a WER of 12.34%.

Moving on to additional datasets, Fadel et al., 2019a benchmarked on a cleaned version of Tashkeela with 55k sentences and 2.3 M words. They achieved a DER of 4.36% and WER of 10.89% using the Shakkala model (Barqawi and Zerrouki,, 2017) which is a character-level Bi-LSTM model. Fadel et al., 2019b also benchmarked on the (Fadel et al., 2019a, ) dataset, achieving a DER of 3% and a WER of 7.39%. They employed a character-level RNN with a Bidirectional Neural Grammar (BNG) module. The architecture included 2 BiCuDNNLSTM layers with 512 hidden units each, and the model was trained for 10 epochs. AlKhamissi et al., (2020) benchmarked on the same data and achieved a DER of 2.09% and WER of 5.08% using both character and word levels. Their architecture involved two levels: words and characters. The model first understood the context of the whole sentence by looking at the sequence of words. Then, for each word, it broke down the word into its characters and paid attention to the relationship between each character and the context of all the words in the sentence. Using this same data, Al-Sabri and Gao, (2021) achieved a DER of 2.71% and WER of 6.9% with an architecture incorporating a novel linguistic feature representation, a Bi-LSTM layer to learn character-level linguistic features, and an attention mechanism to extract the most effective linguistic features. Their work also benchmarked on the Holy Quran (6k sentences), and Sahih Al-Bukhary (9k sentences) and achieved a DER of 2.7% and WER of 6.56 % for the Holy Quran and a DER of 2.52% and WER of 5.20% for Sahih Al-Bukhary.

Focusing on MSA (Table 2). Zalmout and Habash, (2020) adopted a joint modeling approach employing a sequence-to-sequence architecture with distinct parameter-sharing strategies for MSA and Egyptian dialects. Their work was benchmarked on PATB (parts 1, 2, and 3) (Maamouri et al.,, 2004) for MSA and the ARZ dataset (parts 1-5) (Maamouri et al.,, 2012) for the Egyptian dialect. The study achieved an impressive accuracy of 93.9% in diacritized forms. Alqahtani et al., (2020) proposed a multi-task learning model to jointly optimize diacritic restoration along with related NLP tasks like word segmentation, POS tagging, and syntactic diacritization. The paper’s model was benchmarked on PATB (parts 1, 2, and 3) following the same data division as (Diab et al.,, 2013). It yielded promising results, achieving a WER of 7.51% and a DER of 2.54%. Hifny, (2021) introduced a combination of LSTM networks and Maximum Entropy methods, employing knowledge distillation techniques. The study was benchmarked on PATB part 3 (Maamouri et al.,, 2004), with training conducted on 600 stories and testing on 91 articles from Al Nahar News text (European Language Resources Association,, 2001). The model achieved an impressive WER of 4.3%. Subsequently, Qin et al., (2021) explored the use of an adversarial training strategy in the conventional sequence-to-sequence model. A DER of 2.15% and a WER of 6.35% were achieved on the PATB dataset. Thompson and Alshehri, (2022) proposed a multitask learning approach employing a character-level transformer encoder-decoder model for diacritization and parallel text translation, benchmarked on PATB parts 1, 2, and 3, (Maamouri et al.,, 2004) resulting in a WER of 4.79%. Mubarak et al., 2019b and Mubarak et al., 2019a utilized a compilation of 4.5 million words from (Darwish et al.,, 2017) as an MSA train set and a corpus extracted from the WikiNews dataset, consisting of 18.3 thousand words, as an MSA test set. Mubarak et al., 2019b employed a character-level seq2seq model on a sliding window of words, achieving a WER of 4.49% and a DER of 1.21%. In contrast, Mubarak et al., 2019a used a similar approach but included a voting component to select the most common diacritized form for each word, yielding a WER of 4.5%. Finally, in (Darwish et al.,, 2021), a feature-rich, sequence-to-sequence model was used to restore diacritics in Arabic text. The model achieved a DER of 0.9% and a WER of 2.9% by training on the corpus used to train the RDI (Rashwan et al.,, 2015) diacritizer and the Farasa diacritizer (Darwish et al.,, 2017), and evaluating it on the WikiNews dataset.

The comprehensive summaries of all the relevant elements mentioned in the survey for CA and MSA can be found in Table 2 and Table 2 respectively.

Table 1: Benchmarking Tashkeel literature on CA

Benchmark Data	Article	Details on data	Experimental details and results	Approach
	Abbad and Xiong, (2020)	Train: 28M words
Test: 1.7M words	DER of 3.39%
WER of 9.94%
Including partially diacritized sentences	A model composed of a multi-layer RNN with LSTM and Dense layers,
combined with rule-based and statistical correction components.
Abbad and Xiong, (2020)
cleaned version of
Tashkeela	Madhfar and Qamar, (2020)	Additional post-processings
Train: 2.3M sentences
Test: 124K sentences	DER of 1.13 %
WER of 4.43%
Excluding partially diacritized sentences	Character-level Convolutional Bank and Highway Network architecture
followed by a Bidirectional GRU module.
	Abbad and Xiong, (2021)	A subset of (Abbad and Xiong,, 2020) dataset
Train: the first 10 files from the train set
Test: the first file from both validation
and test set	DER of 3.6%
WER of 8.55%
Excluding partially diacritized sentence	Model - employing 128-sized character-level embeddings and four Bi-LSTM
hidden layers - categorizes diacritics into four groups utilizing a sliding
window for data generation.
Arabic Poem
Comprehensive Dataset 2
(APCD2) (Yousef et al.,, 2019)	Abandah et al., 2022b	Cleaning APCD2:
Removing partially diacritized sentences
up to a certain threshold
DS1: (Train: 313K verses, Test: 55K verses)
DS2: (Train: 76K verses, Test: 13K verses)	DS1 test: DER of 4.46%
WER of 15.43%
DS2 test: DER of 3.54 %
WER of 12.34%	Enhanced diacritization models combined with a meter classification model
trained first on DS1 (Diacritics to letters ratio >= 50%) then on DS2 (Diacritics to
letters ratio >= 67%).
	Abandah et al., 2022a	Selecting from APCD2 all the verses in the
training set that have diacritics to letters
ratio of 0.50 or higher
368K diacritized verses consisting of 3.5M
words which were then split into 85% training
set and 15% validation set.	DER of 6.08%
WER of 20.40%	The model of Gheith Abandah, (2020)
	Fadel et al., 2019a	Train: 2M words and 50K sentences
Test: 107K words and 2.5K sentences	DER of 4.36%
WER of 10.89%
Excluding partially diacritized sentences	Shakkala model employs Bi-LSTM networks and character embeddings,
iteratively trained on the Tashkeela corpus, discarding detrimental data.
	Fadel et al., 2019b		DER of 3%
WER of 7.39%
Excluding partially diacritized sentences	Character-level RNN with BNG module, consisting of 2 BiCuDNNLSTMs
layers with 512 hidden units each.
Fadel et al., 2019a ,
Data	AlKhamissi et al., (2020)		DER of 2.09%
WER of 5.08%
Excluding partially diacritized sentences	Two-level architecture analyzes sentence context via word sequences,
then dissects each word into characters, examining the relationship
between each character and the sentence’s word context.
	Al-Sabri and Gao, (2021)		DER of 2.71%
WER of 6.9%
Excluding partially diacritized sentences	Architecture integrates novel linguistic feature representation, a Bi-LSTM
layer for character-level linguistic feature learning, and an attention mechanism
to extract prominent linguistic features.
The Holy Quran
and Sahih Al-Bukhary	Al-Sabri and Gao, (2021)	The Holy Quran: 6K sentences
Sahih Al Bukhary: 9K sentences	The Holy Quran: DER of 2.7%
WER of 6.56 %
Sahih Al-Bukhary: DER of 2.52%
WER of 5.20%
Excluding partially diacritized sentences	Idem.

Benchmark Data

article

Details on Data

Experimental details and results

Approach

Zalmout and Habash, (2020)

PATB parts 1,2, and 3 (Maamouri et al.,, 2004)

follow the same data division as

Diab et al., (2013)

Train : 502K Words

Test : 64K Words

Diacritized forms accuracy

of 93.9%

(The accuracy of the diacritized

form of the words).

Sequence-to-sequence architecture with diverse parameter sharing strategies.

Lexicalized features (lemmas, diacritized forms) are modeled at the character-level,

while non-lexicalized features (gender, number) are modeled at the word-level.

The model employs multitask-learning for non-lexicalized features

and separate decoders for lexicalized features, aiming to enhance context modeling

and disambiguation of ambiguous lexical choices.

Alqahtani et al., (2020)

DER of 2.54%

WER of 7.51%

Multi-task learning model to jointly optimize diacritic restoration along with

related NLP tasks like word segmentation, POS tagging,

and syntactic diacritization

PATB (Maamouri et al.,, 2004)

Hifny, (2021)

PATB Part 3, version2

train: 600 stories (340,281 words)

from the Al Nahar News text (European Language Resources Association,, 2001).

test: 91 articles (about 52,000 words)

from October to December 2002

WER of 4.3%

Combined LSTM networks and Maximum Entropy methods

with knowledge distillation techniques

Qin et al., (2021)

PATB parts 1,2, and 3 follow the same data division as Diab et al., (2013) Train : 502K Words Test : 64K Words

DER of 1.77%

WER of 4.88%

Combination of regularized decoding and adversarial training.

In addition to the gold diacritized sentences, the model synthetize new sentences,

and train on both gold and synthetic ones and the descriminator tries to predict

whether a sentence is a gold one or not

Thompson and Alshehri, (2022)

WER of 4.79%

Multitask learning approach employing a character-level transformer

encoder-decoder model for diacritization and parallel text translation

MSA train Darwish et al., (2017) MSA test: WikiNews corpus (Darwish et al.,, 2017)

Mubarak et al., 2019b

Train : 4.5 M words Test : 18K words

DER of 1.21%

WER of 4.49%

Including partially

diacritized sentences

Character level seq2seq model on a sliding window of words that are

represented using characters, and we employ voting to pick the best

most likely diacritized form from different windows.

Mubarak et al., 2019a

WER of 4.5%

Including partially

diacritized sentences

Character-level sequence-to-sequence model. After diacritization, the system

includes a voting component to select the most common diacritized form

for each word, considering multiple diacritized versions obtained from

consecutive windows of the text.

MSA train: the corpus used

to train the RDI diacritizer (Rashwan et al.,, 2015)

and the Farasa diacritizer (Darwish et al.,, 2017)

MSA test: WikiNews corpus (Darwish et al.,, 2017)

Darwish et al., (2021)

9.7M tokens with approximately

194K unique surface forms

(excluding numbers

and punctuation marks)

DER of 0.9%

WER of 2.9%

Including partially

diacritized sentences

Two separate Deep Neural Network architectures to recover both kinds

of diacritic types, core-word (CW) diacritics, and case endings (CEs).

For CW diacritics, they used a character-level biLSTM model with

associated features, informed using word segmentation information

and a unigram language model as post corrector. For CE recovery,

they employed a word-level biLSTM model that is trained with a rich set

of surface, morphological, and syntactic features.

Table 2: Benchmarking Tashkeel literature on MSA

Table 1: Benchmarking Tashkeel literature on CA