Advancing the Arabic WordNet: Elevating Content Quality

Abstract

High-quality WordNets are crucial for achieving high-quality results in NLP applications that rely on such resources. However, the wordnets of most languages suffer from serious issues of correctness and completeness with respect to the words and word meanings they define, such as incorrect lemmas, missing glosses and example sentences, or an inadequate, Western-centric representation of the morphology and the semantics of the language. Previous efforts have largely focused on increasing lexical coverage while ignoring other qualitative aspects. In this paper, we focus on the Arabic language and introduce a major revision of the Arabic WordNet that addresses multiple dimensions of lexico-semantic resource quality. As a result, we updated more than 58% of the synsets of the existing Arabic WordNet by adding missing information and correcting errors. In order to address issues of language diversity and untranslatability, we also extended the wordnet structure by new elements: phrasets and lexical gaps.

Keywords: Arabic, wordnet, quality, completeness, correctness, phraset, lexical semantics

\NAT@set@cites

Advancing the Arabic WordNet: Elevating Content Quality


Abed Alhakim Freihat 1, Hadi Khalilia 1,2,∗, Gábor Bella 3, Fausto Giunchiglia 1
1Department of Information Engineering and Computer Science, University of Trento, Italy
2Palestine Technical University – Kadoorie, Palestine
3Lab-STICC CNRS UMR 628, IMT Atlantique, Brest, France
1 {abed.freihat, hadi.khalilia, fausto.giunchiglia}@unitn.it,
2[email protected]
3[email protected]

Abstract content

1.   Introduction

WordNets Beckwith et al. (2021) are lexical databases that represent lemmas (lexemes, words) of a language, together with their meanings organised into a lexico-semantic network. Wordnets define meanings as sets of synonymous words called synsets. Synsets are described by a gloss (e.g., a definition in a natural language that represents the synset meaning) as well as example sentences that clarify the usage of words in context. WordNets are used in many NLP applications, such as machine translation Poibeau (2017), information retrieval Nie (2022), or word sense disambiguation Navigli (2009).

The English Princeton WordNet (PWN) (Miller, 1995), as the first wordnet, has been adapted and employed as a foundation for constructing wordnets in other languages.

In general, WordNets are constructed using either the merge or the expand model Vossen (1998). In the merge model, synsets are initially created from pre-existing resources (e.g., dictionaries) in a language. Then, for translability into other languages, the synsets have to be aligned with equivalent English synsets in PWN. For example, the IndoWordNet (Bhattacharyya, 2010) was built following this model. In the expand model, PWN synsets are ‘localized’ or ‘translated’ into target languages. For example, the Polish WordNet (Piasecki et al., 2009) was constructed using this model. In either case, when map** across languages, the PWN synsets (and thus the English language) are usually used as a pivot when translating words across languages.

Wordnets often suffer from quality issues, in a large part due to the use of automated and semi-automated methods for building them Khalilia et al. (2021a, b). In addition, mistakes can be hard to detect as most wordnets do not contain glosses or example sentences. The above are true of the existing Arabic wordnets. The first Arabic wordnet (AWN V1) was built following the expand model (Elkateb et al., 2006) and includes 9,618 synsets translated from PWN to modern standard Arabic. Its second version (AWN V2) (Regragui et al., 2016) extended AWN V1 to 11,269 synsets and was developed using a semi-automatic method and the expand model. As we show in our paper, both wordnets suffer from correctness and completeness issues, and lack glosses and examples. By correctness we refer to the accuracy of lemmas in representing the meaning of a synset, while completeness refers to the extent to which a synset includes all words that are synonymous based on the synset meaning. For example, without an Arabic gloss and example sentences, it is hard to judge the correctness and completeness of the AWN V1 synset {, , , } that corresponds to the English WordNet synset {actuation, propulsion: the act of propelling; actuation of this app needs a password}.

In this paper, we introduce AWN V3, a significantly extended and quality-enhanced version of AWN V1. The novel contents of this new Arabic wordnet are: (a) the addition of glosses and examples to all synsets; (b) the improvement of the correctness and the completeness of the wordnet by adding missing lemmas and removing erroneous ones; (c) a reduced level of polysemy with respect to other wordnets through the elimination of redundant word meanings, based on our prior research; and (d) addressing phenomena of language diversity by introducing new linguistic information, namely lexical gaps that explicitly indicate untranslatability Giunchiglia et al. (2018); Bella et al. (2022) and phrasets, i.e., free combinations of words that express the meaning of a synset in case of nonexistent equivalent lemmas Bentivogli and Pianta (2000). Such explicit representations of untranslatability distinguishes them from resource incompleteness (i.e., words merely missing from the resource) and give indications to both human and machine translators about particularly difficult cases of translation. Also, we tackle the polysemy problem of the source synsets by not inheriting specialization polysemy Freihat et al. (2013) and compound noun polysemy Freihat et al. (2015) problem in the target synsets.

Accordingly, the paper presents the following contributions: (1) the extension of the existing Arabic wordnet model by devices for tackling untranslatability: lexical gaps and phrasets; (2) a development methodology for lexical databases, inscribed within the expand model, that ensures a high-quality and diversity-aware output; (3)  AWN V3, the new and freely available Arabic wordnet resource as described above.

The rest of the paper is organized as follows. In Section 2, we introduce the state of the art of Arabic wordnets. Sections 3 and 4 present our contributions in addressing language diversity and excessive polysemy, respectively. In Section 5, we describe our synset localization method. Section 6 presents AWN V3, the high-quality Arabic lexical resource resulting from our work. Finally, we provide conclusions and discuss future work in Section 7.

2.   State of the Art

The first effort of building an Arabic wordnet was undertaken by Diab (2004). She introduced an automated approach known as SALAAM (Sense Assignment Leveraging Annotations And Multilinguality) to translate synsets from PWN into standard Arabic. This translation process relied on PWN 1.7 and an English-Arabic corpus as knowledge sources. Notably, her primary focus was on translating lemmas without glosses and example sentences. This approach was evaluated using a dataset comprising 447 synsets.

AWN V1 represents the inaugural Arabic WordNet developed by Elkateb et al. (2006). The development approach closely mirrors the methodology employed in creating EuroWordnet (Vossen, 1999), which consists of two phases. The first phase involves constructing a foundational core wordnet centered around base concepts Vossen (1998), while the second phase focuses on expanding the core wordnet’s coverage by incorporating additional criteria. This version of AWN is aligned with PWN in terms of structure and content covering WordNet domains defined by Magnini and Cavaglia (2000). This wordnet also integrates the Suggested Upper Merged Ontology (SUMO) to provide a formal semantic framework Elkateb et al. (2006).

In the case of the core Arabic WordNet, the process involves encoding the Common Base Concepts (CBCs) found in the EuroWordnet and BalkaNet (Tufis et al., 2004) as synsets. This is achieved through a manual translation effort, wherein all English synsets having an equivalence relation in the SUMO ontology are translated into their corresponding Arabic synsets. Figure 1 illustrates this process, showing an example of how Arabic Synsets are linked to overarching SUMO terms that directly correspond to the associated English synsets. Each translated synset is validated by evaluating the coverage of synset lemmas and the domain distribution of these synsets. These efforts produced 9,228 synsets in the core wordnet of AWN V1. The distribution of these synsets, categorized by Part-Of-Speech (POS), is detailed in Table  1.

Refer to caption
Figure 1: SUMO map** to WordNets Elkateb et al. (2006)
POS/WN AWN V1 (Core WN) AWN V1 (Ext. WN) AWN V2
Noun 6,252 6,558 7,960
Verb 2,260 2,507 2,538
Adjective 606 446 271
Adverb 106 107 500
Total 9,228 9,618 11,269
Table 1: The count of Arabic synsets in each AWN version based on POS

To expand the core of AWN, Elkateb et al. (2006) introduced the Suggested Translation semi-automatic method, using available bilingual (Arabic-English) resources to extract <English word, Arabic word, POS> tuples. This method served a similar purpose in the development of Spanish WordNet (Farreres et al., 2002) and BalkaNet (Tufis et al., 2004). Building on eight heuristic procedures, associations between Arabic words and PWN synsets were assigned scores, relying on Arabic-English bilingual resources. Lexicographers utilized these scores to create new synsets or supplement existing ones with additional lemmas. The total number of synsets in this version is 9,618.

After the first release of AWN, there were many attempts to enrich its content concerning the number of synsets, lemma, and the relations between them. Alkhalifa and Rodríguez (2009) introduced an automated method to enhance the coverage of named entities (NE) within AWN V1. This method used Wikipedia and established connections to PWN 2.0. In this study, 1,147 synsets were generated, covering 1,659 named entities across 31 general categories. In these studies, Boudabous et al. (2013); Batita and Zrigui (2018) proposed a hybrid linguistic approach grounded in morphological patterns. They used Wikipedia and PWN to enrich AWN with new semantic relations. The former augmented AWN by establishing relations between nominal synsets, while the latter incorporated antonym relations.

As part of the ongoing efforts to enrich AWN, Abouenour et al. (2013) introduced a semi-automatic method to increase the coverage of AWN V1. Their objective was to enhance named entities (NEs), verbs, and noun synsets. For the enrichment of NE synsets, the authors present a three-step methodology, which translates YAGO (Yet Another Great Ontology) entities Suchanek et al. (2008) into Arabic instances and extracts Arabic synsets. Regarding verb synsets, the authors adopted a two-step approach inspired by Rodríguez et al. (2008). The first step involved suggesting new verbs by translating a set of verbs from VerbNet (Schuler, 2005) into standard Arabic. In the second step, Arabic verbs were interconnected with AWN synsets by establishing a graph connecting each Arabic verb with its corresponding English verbs in PWN. The authors employ a two-step method that detects hyponym/hypernym pairs from the web to enrich noun synsets. The overall result of this work is introducing a new version of AWN, known as AWN V2, including 11,269 synsets (for more details, see Table  1).

Despite the previous efforts, which primarily focused on expanding the coverage of synset lemmas, AWN still falls short compared to other WordNets in terms of content quality. This assessment was highlighted by Batita and Zrigui (2018), who emphasized in their research that “AWN has very poor content in both quantity and quality levels.” Our work focuses on the synset quality level, mainly on the synset correctness and completeness dimensions. AWN V1 marks a significant milestone for several reasons. Firstly, it encompasses the most common concepts and word senses found in PWN 2.0, ensuring a comprehensive representation in AWN V1. Secondly, its design and integration with PWN synsets facilitate cross-language usability. Finally, like other wordnets, AWN establishes a connection with SUMO, further enhancing its utility. Conversely, several issues related to synset quality have been identified in the majority of the synsets in this resource. These issues are also observed in AWN V2, as outlined below:

  1. 1.

    All synsets lack gloss and/or illustrative examples.

  2. 2.

    Many synsets contain incorrect senses, lemmas (including incorrect word forms or repeated words), and incorrect relations between synsets.

  3. 3.

    Many synsets lack essential senses, lemmas, and necessary relations.

For instance, consider the following synset { , } presented in AWN V1, corresponding to the English synset {noise: sound of any kind, especially unintelligible or dissonant sound; he enjoyed the street noises}. In AWN V2, this synset was enriched to include {, }, resulting in {, , , }. In this case, the synset incorporates two erroneous lemmas {, }, which are not found in Arabic dictionaries such as Almaany dictionary111http://www.almaany.com/thesaurus.php. Additionally, it lacks the lemma , which means noise.

In this paper, we enhance the accuracy of synset elements in AWN V1 by addressing incorrect lemmas and expanding the coverage of synsets through the addition of missing lemmas.

3.   Addressing Language Diversity

Cultural and linguistic differences abound across the more than seven thousand languages in the world, to which we simply refer as language diversity. To give a few examples from lexical semantics, the English word cousin, meaning the child of your aunt or uncle, does not have any equivalent term in Arabic. In contrast, the Arabic word , which means the brother of your father, does not exist in English Khalilia et al. (2023). Another example is from colors: the Italian word marrone, which means chestnut color, does not have an equivalent word in Persian and Welsh McCarthy et al. (2019), while the Breton glaz, spanning a range of hues between blue and green, has no equivalent in English or in the majority of Indo-European languages.

Linguists refer to such cases of lexical untranslatability as lexical gaps. A lexical gap happens when a word in one language is not lexicalized in another language Lehrer (1970). In such cases, speakers can express a similar meaning through a free combination of words called phrasets Bentivogli and Pianta (2000).

As in most wordnets, instances of language diversity are not explicitly indicated in the existing versions of AWN, instead map** Arabic synsets to PWN synsets in an approximate manner. Such inaccuracies lead to the corruption of resource quality and an Anglo-Saxon meaning bias, also reducing the performance of applications relying on the resource, such as translation tools.

This paper introduces a new version of AWN that explicitly represents gaps and phrasets. For example, {adjectively: as an adjective; nouns are frequently used adjectively} is identified a gap in the resulting resource; at the same time, this phraset is used to describe this synset. In addition, in the case of lexicalizations (translated synsets), to increase the clarity of synset meaning and understandability, phrasets are used. For example, {unwittingly, unknowingly, inadvertently: without knowledge or intention; he unwittingly deleted the references } is translated { اً: , }, and the phraset is used.

Lexical gaps are implemented in our resource at the synset level, while phrasets are implemented on the word level.

4.   Addressing Polysemy

Polysemy is a well-known problem in PWN. It has been addressed in many studies, such as Gonzalo (2004); Mihalcea and Moldovan (2001); Buitelaar (1998); Freihat (2014). In our previous research Freihat et al. (2016), polysemy was classified into several types. These types are homonomy, metaphoric, metonymy, specialization polysemy, and compound noun polysemy. While the first three polysemy types are essential in lexical resources, the latter two are considered the main reasons behind the highly polysemous nature of WordNet that makes WordNet too fine-grained for NLP. As an example of compound noun polysemy, the word head has more than 30 synsets (meanings) in PWN. Another example of compound noun polysemy is the word center, which has 18 synsets. The problem becomes more clear in the Arabic ontology Jarrar (2021), which has more than 500 synsets meaning center. For example, the word turtledove is polysemous because it belongs to the following two synsets: {australian turtledove, turtledove: small Australian dove.}, {turtledove: any of several Old World wild doves.} Of course, it is possible to use the word turtledove to refer to any kind of turtledoves when it is clear from the context which kind of turtledoves we are speaking about. At the same time, adding the word turtledove as a synonym to all kinds of turtledoves in the lexical resource is useless and just makes the resource hard to use.

According to our research Freihat et al. (2015), the word sense disambiguation for these two types is similar to anaphora resolution and does not require including all these possible meanings in a lexical resource because they lead to the problem of sense enumeration which makes such resources very hard to use in NLP.

5.   Addressing Synset Quality

In the following, we list the goals of our approach:

  1. 1.

    Synset glosses: Each synset should have a gloss that clearly identifies its meaning. Without such gloss, we will not be able to understand the synset, moreover, we will not be able to differentiate between the meanings of the same lemma in different synsets, for example, the word ’love’ has more than one meaning e.g, belongs to different synsets

  2. 2.

    Synset Examples: Each lemma in a synset should have at least one example to clarify its usage. Such examples also allow us to verify the synonymity between the synset lemmas. This is crucial for the synset correctness.

  3. 3.

    Language diversity and phrasets: Ideas are expressed in cultures in different ways, which leads to untranslatability in some languages (e.g., a lexical gap). Another phenomenon in Arabic (and maybe in other languages) is the usage of prepositional phrasets to express a synset meaning. For example, the meaning of this synset { someday: some unspecified time in the future; someday you will understand my actions} is identified as a lexical gap in Arabic, and the phraset اً is used to express this meaning. We add these phrasets to the Arabic WordNet to increase the understandability of synsets. Also, such phrasets can be used in NLP applications to identify the intended synset.

  4. 4.

    Errors in the source WordNet (PWN): PWN suffers from the polysemy problem. According to our previous approaches, the source of the polysemy problem is due to the specialization polysemy and sense enumeration. In our work, we avoid such polysemy types in the resulting resource to enhance AWN usability in NLP applications.

  5. 5.

    Named entities: A lexical resource should include concepts only. It is not the correct place to include named entities, which may be another source of noise in lexical semantic resources.

Our approach consists of three steps:

  1. 1.

    Task generation: We have collected the data from AWN V1 and prepared the spreadsheet to be provided for translation.

  2. 2.

    Task enhancement: The translators translated the corresponding PWN synset glosses, then performed the following: adding missing lemmas, and examples for the synset elmmas, removing wrong lemmas from the original Arabic synsets, identifying gaps in the case of untranlatability, and adding phrasets for increasing the understandability of synsets.

  3. 3.

    Task Validation: Validation is carried out in two phases: 1) Each contribution provided by one of the translators was validated by the other. In the second phase, a linguistic expert validates and approves the contribution.

5.1.   Task generation

This section describes the essential materials required for the next step of the methodology. The preparation process involves constructing a dataset containing AWN V1 synsets as well as the corresponding PWN synsets. In this context, AWN V1 and PWN browsers are utilized for data retrieval. This dataset is customized in a spreadsheet for usability and simplicity in providing contributions, in which the linguistic expert (the first author) organizes synsets into four categories (each in one sheet) based on the part of speech (POS). Each row within the spreadsheet represents a synset and includes information such as the synset ID, lemmas, gloss, and example sentences in Standard Arabic and English. Additionally, empty slots are provided for inserting missing lemmas, a gloss, examples, and comments by the data provider (translator) in Arabic. One additional slot is designated for validation purposes, along with comments from the validator. In this step, the linguistic expert excludes all (42 synsets) named entities from the spreadsheet.

5.2.   Task enhancement

Contributions for synset enhancement, which involve the addition of missing information or correction of synset elements, are made by two translators and validated by a language expert. An overview of our contribution collection workflow is illustrated in Figure 2. As depicted, the workflow is structured into two cycles, with the aim of ensuring the quality of results. The first cycle operates between the two translators, where each translator’s contributions are subject to verification by the other. The second cycle involves the validation of accepted contributions by a linguistic expert.

Refer to caption
Figure 2: The workflow of the contribution collection

The process of synset enhancement in the first cycle was carried out by two native speakers. Regarding their socio-linguistic background, both translators possess at least a bachelor’s degree in the field of translation (English-Arabic). Before the translation, translators have been trained as described in the following subsections.

5.2.1.   Synset understanding

Central to this process is ensuring that the translator possesses a clear understanding of the synset they are tasked with translating. Misunderstandings can arise when the translator does not grasp a thorough understanding of both the synset lemmas and the gloss in English. The translators are asked to understand each PWN synset in the spreadsheet using the following notable instructions:

  • Use external resources such as dictionaries and Wikipedia to understand the meaning of the synset.

  • They are given the authority to skip the synset or leave a comment when they do not understand the meaning of the synset.

5.2.2.   Lexical gap identification and synset lexicalisation

A lexical gap happens when either the meaning of the concept in a source language is not known in the culture of the target language or the concept can be lexicalized only through word-free combination Giunchiglia et al. (2018). This means that there is no lexical unit (single word or restricted collocation) that corresponds to any of the source language lemmas. In this step, for each English synset in the spreadsheet, the translator decides whether it has an equivalent meaning in Arabic (lexicalisation exists) or is a lexical gap based on the understanding of the English synset and using a bilingual dictionary. If an English synseti is a gap, the translator performs step A; otherwise, she/he performs steps B and C.

Step A: Lexical gap processing, in this step, the translator is asked to mark the English synseti as a lexical gap in the spreadsheet and provide a phraset in Arabic. For example, the synset {expressively: with expression, in an expressive manner; she gave the order to the waiter, using her hands very expressively} is identified as a lexical gap in Arabic, and meaning (an expressive way) is provided as a phraset to this synset.

Step B: Synset translation, after the translator confirms the existence of the meaning of the English synseti in Arabic, she/he translates this synset to Arabic. This translation includes the following steps.

  1. 1.

    Translating synset gloss: The translation is across language and cross-cultural communication. A translation should give a complete transcript of the synset; meanwhile, the style and manner of writing should be at least the same quality as the gloss of English. Above all, faithfulness, expressiveness, and closeness are the important three elements of translation. The gloss should explicitly express the semantics and the common attributes of a synset.

  2. 2.

    Translating synset lemmas: Translators should keep two key considerations in mind while translating synset lemmas. Firstly, this translation process does not entail a direct one-to-one correspondence between English and Arabic terms. Secondly, it is important to note that the set of lemmas within the English synset may not be exhaustive, meaning it might not contain all the synonyms associated with the synset. To translate the synset lemmas, we go through the following phases:

    • English Lemmas translation: Translate the English synset lemmas into Arabic. The result of this step is a set of lemmas of the length n𝑛nitalic_n, where n𝑛nitalic_n is the number of lemmas in the English synset.

    • Arabic synonyms collection: For each translated lemma, the translators collect the lemma synonyms in Arabic. The result of this phase is m synonym sets in Arabic, mn𝑚𝑛m\leq nitalic_m ≤ italic_n (since some Arabic lemmas may have empty synonym sets).

    • Arabic Synonyms validation: Based on the synset gloss, for each of the m synonym sets in Arabic, the translators exclude all synonyms that do not belong to the synset. Use the provided examples in the English synset and other examples to include/exclude the synonyms in this phase.

    • Arabic lemmas collection: The translators collect the Arabic lemmas, resulting in the translation process in phase (1) and the synonyms produced from phase (2) and put them as the Arabic synset lemmas. In the case of polysemy, we solve the specialization polysemy and compound noun polysemy. For example, جِسْ is excluded from this synset { جِسْ فِزْئِ, جِسْ} which corresponds {object, physical object: a tangible and visible entity, an entity that can cast a shadow; pens, books and bags are school objects}

    • Arabic lemmas ordering: The translators order the Arabic collected lemmas in phase (4), wherein the first lemma is the Arabic synset preferred term and so on (in descending order of importance). Based on the examples provided in phase (3) (and other examples if needed), the translator gives preferences for the lemmas based on these examples.

  3. 3.

    Translating synset examples: Examples within a synset contribute to a clearer comprehension of how to utilize the synset lemmas, consequently enhancing the overall understanding of the fully lexicalized synset. We employ the same examples crafted during the lemma translation phase as synset examples. This approach signifies that we do not solely translate the examples found in the English synset. Ideally, we provide an example sentence in Arabic for each synonym within the synset, even if the English synset does not contain examples at all. The provided examples are incorporated into the Aarbic synsets, aligning them with the order of their respective synonyms

Step C: Comparing the produced (translated) synset in Step (B) with the corresponding synset from AWN V1, At this stage, a translator compares the translated synset generated in Step B and its corresponding Arabic synset, as imported from AWN V1. This Arabic synset is designated to correspond with the English synseti in the spreadsheet. Based on the gloss and examples provided in the generated synset, the translators undertake the following actions: (1) Copy lemmas from the translated synset to the AWN V1 synset if they are missing from the AWN V1 synset. (2) Exclude the lemmas from the AWN V1 synset, which are not covered by the synset gloss and examples. (3) Copy the gloss and the examples from the translated synset to the AWN v1 synset if they are missing in the latter.

5.3.   Task validation

The validation process consists of two phases. In the first phase, the two translators validate the resulting synsets (stored in a spreadsheet containing English and produced Arabic synsets) in an alternating manner, checking each synset (and gap) one by one. During the validation, each of them considers the following:

  1. 1.

    Gap validation: A translator validates synsets marked as lexical gaps in Arabic, either as confirmed gaps or as non-gaps due to an existing lexicalization in Arabic, which he/she should provide a gloss and lemmas of that synset.

  2. 2.

    Gloss validation: The Arabic gloss expresses the intended meaning of the English synset. Also, the Arabic gloss is easy to understand and does not contain typos or grammatical errors.

  3. 3.

    Lemmas validation: Synset lemmas should be correct (e.g., not include wrong lemmas) and complete (e.g., there are no missing lemmas). In addition, the validator can use the examples to check synonymity between lemmas.

  4. 4.

    Examples validation: Each lemma has at least one example. The examples are natural and express the intended usage.

In case of disagreement, the affected synsets are sent back to the translators with the validator’s comment. The accepted synsets are sent to the expert validation.

In the second phase, An Arabic linguistic expert performs this validation on a spreadsheet containing the resulting synsets (and gaps) only, without including the English synsets, which both translators accepted in the previous step. His task is to approve the final resulting synsets. The same criteria used in the previous validation phase for validating gaps, glosses, lemmas, and examples are adopted in this step.

6.   Evaluation and the Resulting Resource

This section demonstrates the use of the methodology described in Section 5 on evaluating and improving the content quality of AWN V1 depending on PWN as a reference to our work. As mentioned above, AWN V1 includes 9,618 synsets written in Modern Standard Arabic (MSA), which refers to the standard form of the language used in academic writing, formal communication, classical poetry, and religious sermons.

In this study, contributions are provided by two translators (each is an Arabic native speaker). They were born and educated within the Arabic-speaking community, having completed at least their high school education within this community.

Four experiments (one for each POS) are performed to evaluate the extended version of AWN V1 synsets and tackle synset quality issues using our method. In each experiment, a spreadsheet includes Arabic synsets imported from the AWN V1 and their corresponding English synsets. Each spreadsheet contains data for a specific POS and serves as an input dataset to the contribution (synset quality enhancement) collection step. The experiments are conducted on 6,516 nouns, 2,507 verbs, 446 adjectives, and 107 adverbs (see Table 2 for more details).

Noun Verb Adjective Adverb Total Synsets 6,516 2,507 446 107 9,576 Words 13,659 5,878 761 262 20,560

Table 2: The count of synsets and words (imported from the extended AWN V1- without named entities) in the input dataset based on POS

In the contribution collection, for each Arabic synset in a row in the spreadsheet, a translator is tasked to translate the corresponding PWN synset to Arabic or identify it as a lexical gap using a bilingual (English-Arabic) linguistic resource, such as the Al-Mawrid Al-Qareeb dictionary Baalbaki (2005). After that, if a lexicalization exists in Arabic, the translator tackles the latter by comparing a generated translated Arabic synset with the AWN V1 synset in the same row, which follows by adding missing synset lemmas, gloss, and example sentences; and/or rectifying incorrect elements. Also, if the English synset is a gap in Arabic, he/she marks it as a lexical gap and provides a phraset to express the synset (Note that phraset is also used for some translated synsets to increase the understandability). To our knowledge, our resulting resource (AWN V3) is the first Arabic Wordnet that identifies gaps and provides phrasets.

The overall effort to collect contributions resulted in updating 5,554 synsets from AWN V1. We added 2,726 new lemmas, 9,322 new glosses, and 12,204 new example sentences. We also identified 236 lexical gaps and inserted 701 phrasets. Furthermore, we deleted 8751 incorrect lemmas. More details regarding the counts of these contributions are presented in Table  3. See the dataset uploaded to GitHub222https://github.com/HadiPTUK/AWN3.0. For each POS, two spreadsheets were uploaded to GitHub; the first file includes the final resulting Arabic synsets, and the second contains the added and deleted synset components.

Noun Verb Adj Adv Total Updated synsets 3,938 1,364 181 71 5,554 New lemmas 2,581 64 72 9 2,726 Deleted lemmas 6,050 2,387 223 91 8,751 New glosses 6,511 2,258 446 107 9,322 New examples 7,597 3,620 782 205 12,204 Gaps 28 187 0 21 236 Phrasets 364 275 0 62 701

Table 3: Statistics of the data addition and deletion into/from AWN

Validation was carried out by an Arabic linguistic expert who has a Ph.D. in the Arabic language and is a university instructor at the linguistics department. As introduced above, the expert follows the criteria described in Section 5.3 to verify produced synsets. Results can be seen in Table 4, where by correctness we understand the number of contributions validated as correct divided by the total number of contributions. These contributions can be newly added or deleted lemmas, collected glosses and example sentences, identified lexical gaps, and inserted phrasets. For example, in the case of an added lemma, the validator either confirms the addition or rejects it by leaving a comment. For instance, {مِقْيَ} meaning a measuring tool is deemed an incorrect added word to the synset { مِقْدَ, قَدْ, كَمّ, كَمِّيَّ} which corresponds to {measure, quantity, amount: how much there is of something that you can quantify; he has a big amount of money}. In the case of identified gaps, the validator either as confirmed gaps or as non-gaps due to an existing lexicalization in Arabic, which the validator needs to indicate. For instance, the following English synset {try, try on: put on a garment in order to see whether it fits and looks nice; Try on this sweater to see how it looks} is considered a gap. The validator rejected it and provided this word قَ with the same meaning.

Contribution Correctness
New lemmas 97.34%
Deleted lemmas 98.89%
New glosses 98.76%
New examples 99.13%
Gaps 96.82%
Phrasets 97.54%
Total 98.08%
Table 4: Validator evaluation of translator contributions

Upon discussion between the validator (linguistic expert) and the translators, the mistakes made by the latter can be explained by misunderstandings of the meanings of certain concepts provided in English. The validator made sure to exclude or fix the mistakes, bringing the correctness of the final dataset closer to 100%.

7.   Conclusion and Future Work

In this paper, we evaluate and address the quality—correctness and completeness—of synsets from AWN V1 across four parts of speech (nouns, verbs, adjectives, and adverbs). The resulting total of 9,576 synsets are introduced as AWN V3—an enhanced version of AWN with corrected and extended lemmas, as well as added glosses and example sentences. In order to represent English words not directly translatable to Arabic, we introduce phrasets to provide approximate phrase-level translations and lexical gaps to indicate untranslatability. As part of our future work, we will apply the methodology described in order to increase the coverage of Arabic synsets, based on AWN V2 as well as the remaining synsets in PWN.

8.   Bibliographical References

\c@NAT@ctr

References

  • Abouenour et al. (2013) Lahsen Abouenour, Karim Bouzoubaa, and Paolo Rosso. 2013. On the evaluation and improvement of Arabic Wordnet coverage and usability. Language resources and evaluation, 47:891–917.
  • Alkhalifa and Rodríguez (2009) Musa Alkhalifa and Horacio Rodríguez. 2009. Automatically extending NE coverage of Arabic Wordnet using Wikipedia. In Proc. Of the 3rd International Conference on Arabic Language Processing CITALA2009, Rabat, Morocco, pages 23–30.
  • Baalbaki (2005) Rohi Baalbaki. 2005. Al-mawrid Al-qareeb Arabic-English Dictionary. Dar El Ilm Lilmalayin, Lebanon.
  • Batita and Zrigui (2018) Mohamed Ali Batita and Mounir Zrigui. 2018. The enrichment of Arabic Wordnet antonym relations. In Computational Linguistics and Intelligent Text Processing: 18th International Conference, CICLing 2017, Budapest, Hungary, April 17–23, 2017, Revised Selected Papers, Part I 18, pages 342–353. Springer.
  • Beckwith et al. (2021) Richard Beckwith, Christiane Fellbaum, Derek Gross, and George A Miller. 2021. Wordnet: A lexical database organized on psycholinguistic principles. In Lexical Acquisition, pages 211–232. Psychology Press.
  • Bella et al. (2022) Gábor Bella, Khuyagbaatar Batsuren, Temuulen Khishigsuren, and Fausto Giunchiglia. 2022. Linguistic diversity and bias in online dictionaries. University of Bayreuth African Studies Online, page 173.
  • Benítez et al. (1998) Laura Benítez, Sergi Cervell, Gerard Escudero, Mònica López, German Rigau, and Mariona Taulé. 1998. Methods and tools for building the Catalan Wordnet. arXiv preprint cmp-lg/9806009.
  • Bentivogli and Pianta (2000) Luisa Bentivogli and Emanuele Pianta. 2000. Looking for lexical gaps. In Proceedings of the ninth EURALEX International Congress, pages 8–12. Stuttgart: Universität Stuttgart.
  • Boudabous et al. (2013) Mohamed Mahdi Boudabous, Nouha Chaâben Kammoun, Nacef Khedher, Lamia Hadrich Belguith, and Fatiha Sadat. 2013. Arabic Wordnet semantic relations enrichment through morpho-lexical patterns. In 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA), pages 1–6. IEEE.
  • Buitelaar (1998) Peter Paul Buitelaar. 1998. CoreLex: systematic polysemy and underspecification. Brandeis University.
  • Diab (2004) Mona Diab. 2004. The feasibility of bootstrap** an Arabic Wordnet leveraging parallel corpora and an English Wordnet. In Proceedings of the Arabic Language Technologies and Resources, NEMLAR, Cairo.
  • Freihat (2014) Abed Alhakim Freihat. 2014. An organizational approach to the polysemy problem in Wordnet. Ph.D. thesis, University of Trento.
  • Freihat et al. (2013) Abed Alhakim Freihat, Fausto Giunchiglia, and Biswanath Dutta. 2013. Solving specialization polysemy in wordnet. Int. J. Comput. Linguistics Appl., 4(1):29–52.
  • Freihat et al. (2016) Abed Alhakim Freihat, Fausto Giunchiglia, and Biswanath Dutta. 2016. A taxonomic classification of Wordnet polysemy types. In Proceedings of the 8th Global WordNet Conference (GWC), pages 106–114.
  • Freihat et al. (2015) Abed Alhkaim Freihat, Biswanath Dutta, and Fausto Giunchiglia. 2015. Compound noun polysemy and sense enumeration in Wordnet. In Proceedings of the 7th International Conference on Information, Process, and Knowledge Management (eKNOW), pages 166–171.
  • Giunchiglia et al. (2018) Fausto Giunchiglia, Khuyagbaatar Batsuren, and Abed Alhakim Freihat. 2018. One world-seven thousand languages (best paper award, third place). In International Conference on Computational Linguistics and Intelligent Text Processing, pages 220–235. Springer.
  • Gonzalo (2004) Julio Gonzalo. 2004. Sense proximity versus sense relations. GWC 2004, page 5.
  • Jarrar (2021) Mustafa Jarrar. 2021. The Arabic ontology–an Arabic Wordnet with ontologically clean content. Applied ontology, 16(1):1–26.
  • Khalilia et al. (2023) Hadi Khalilia, Gábor Bella, Abed Alhakim Freihat, Shandy Darma, and Fausto Giunchiglia. 2023. Lexical diversity in kinship across languages and dialects. Frontiers in Psychology, 14.
  • Khalilia et al. (2021a) Hadi Khalilia, Abed Alhakim Freihat, and Fausto Giunchiglia. 2021a. The quality of lexical semantic resources: A survey. In Proceedings of the 4th International Conference on Natural Language and Speech Processing (ICNLSP 2021), pages 117–129.
  • Khalilia et al. (2021b) Hadi Khalilia, Abed Alhakim Freihat, Fausto Giunchiglia, et al. 2021b. The dimensions of lexical semantic resource quality. In Proceedings of the Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) co-located with ICNLSP 2021, pages 15–21. ACL Anthology.
  • Lehrer (1970) Adrienne Lehrer. 1970. Notes on lexical gaps. Journal of linguistics, 6(2):257–261.
  • Magnini and Cavaglia (2000) Bernardo Magnini and Gabriela Cavaglia. 2000. Integrating subject field codes into Wordnet. In LREC, volume 1413.
  • McCarthy et al. (2019) Arya D McCarthy, Winston Wu, Aaron Mueller, Bill Watson, and David Yarowsky. 2019. Modeling color terminology across thousands of languages. arXiv preprint arXiv:1910.01531.
  • Mihalcea and Moldovan (2001) Rada Mihalcea and Dan I Moldovan. 2001. Ez. Wordnet: Principles for automatic generation of a coarse grained wordnet. In FLAIRS conference, pages 454–458.
  • Navigli (2009) Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2):1–69.
  • Nie (2022) Jian-Yun Nie. 2022. Cross-language information retrieval. Springer Nature.
  • Poibeau (2017) Thierry Poibeau. 2017. Machine translation. MIT Press.
  • Rodríguez et al. (2008) Horacio Rodríguez, David Farwell, Javi Ferreres, Manuel Bertran, Musa Alkhalifa, and Maria Antònia Martí. 2008. Arabic wordnet: Semi-automatic extensions using bayesian inference. In LREC.
  • Suchanek et al. (2008) Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2008. Yago: A large ontology from Wikipedia and Wordnet. Journal of Web Semantics, 6(3):203–217.
  • Vossen (1998) Piek Vossen. 1998. A multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic Publishers. doi, 10:978–94.

9.   Language Resource References

\c@NAT@ctr

 

  • Bhattacharyya (2010) Pushpak Bhattacharyya. 2010. IndoWordNet. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. European Language Resources Association (ELRA).
  • Elkateb et al. (2006) Sabri Elkateb, William Black, Horacio Rodríguez, Musa Alkhalifa, Piek Vossen, Adam Pease, and Christiane Fellbaum. 2006. Building a WordNet for Arabic. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), pages 29–34. European Language Resources Association.
  • Farreres et al. (2002) Javier Farreres, Horacio Rodríguez, and Karina Gibert. 2002. Semiautomatic creation of taxonomies. In COLING-02: SEMANET: Building and Using Semantic Networks.
  • Miller (1995) George A Miller. 1995. Wordnet: a lexical database for English. Communications of the ACM, 38(11):39–41.
  • Piasecki et al. (2009) Maciej Piasecki, Bernd Broda, and Stanislaw Szpakowicz. 2009. A Wordnet from the ground up. Oficyna Wydawnicza Politechniki Wrocławskiej Wrocław.
  • Regragui et al. (2016) Yasser Regragui, Lahsen Abouenour, Fettoum Krieche, Karim Bouzoubaa, and Paolo Rosso. 2016. Arabic Wordnet: New content and new applications. In Proceedings of the 8th Global WordNet Conference (GWC), pages 333–341.
  • Schuler (2005) Karin Kipper Schuler. 2005. VerbNet: A broad-coverage, comprehensive verb lexicon. University of Pennsylvania.
  • Tufis et al. (2004) Dan Tufis, Dan Cristea, and Sofia Stamou. 2004. Balkanet: Aims, methods, results and perspectives. a general overview. Romanian Journal of Information science and technology, 7(1-2):9–43.
  • Vossen (1999) PJTM Vossen. 1999. Eurowordnet.