Search | arXiv e-print repository

arXiv:2103.11859 [pdf, other]

Part of speech and gramset tagging algorithms for unknown words based on morphological dictionaries of the Veps and Karelian languages

Authors: Andrew Krizhanovsky, Natalia Krizhanovsky, Irina Novak

Abstract: This research devoted to the low-resource Veps and Karelian languages. Algorithms for assigning part of speech tags to words and grammatical properties to words are presented in the article. These algorithms use our morphological dictionaries, where the lemma, part of speech and a set of grammatical features (gramset) are known for each word form. The algorithms are based on the analogy hypothesis… ▽ More This research devoted to the low-resource Veps and Karelian languages. Algorithms for assigning part of speech tags to words and grammatical properties to words are presented in the article. These algorithms use our morphological dictionaries, where the lemma, part of speech and a set of grammatical features (gramset) are known for each word form. The algorithms are based on the analogy hypothesis that words with the same suffixes are likely to have the same inflectional models, the same part of speech and gramset. The accuracy of these algorithms were evaluated and compared. 313 thousand Vepsian and 66 thousand Karelian words were used to verify the accuracy of these algorithms. The special functions were designed to assess the quality of results of the developed algorithms. 92.4% of Vepsian words and 86.8% of Karelian words were assigned a correct part of speech by the developed algorithm. 95.3% of Vepsian words and 90.7% of Karelian words were assigned a correct gramset by our algorithm. Morphological and semantic tagging of texts, which are closely related and inseparable in our corpus processes, are described in the paper. △ Less

Submitted 22 March, 2021; originally announced March 2021.

Comments: 17 pages, 4 tables, 7 figures, published in the conference proceeding

MSC Class: 68T50 ACM Class: H.3.1; H.3.6

arXiv:2006.11572 [pdf, other]

SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

Authors: Ekaterina Vylomova, Jennifer White, Elizabeth Salesky, Sabrina J. Mielke, Shijie Wu, Edoardo Ponti, Rowan Hall Maudslay, Ran Zmigrod, Josef Valvoda, Svetlana Toldova, Francis Tyers, Elena Klyachko, Ilya Yegorov, Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor Chodroff , et al. (3 additional authors not shown)

Abstract: A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages, many of which are low resource… ▽ More A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging. △ Less

Submitted 14 July, 2020; v1 submitted 20 June, 2020; originally announced June 2020.

Comments: 39 pages, SIGMORPHON

arXiv:2002.00734 [pdf]

Analysis of the quotation corpus of the Russian Wiktionary

Authors: A. Smirnov, T. Levashova, A. Karpov, I. Kipyatkova, A. Ronzhin, A. Krizhanovsky, N. Krizhanovsky

Abstract: The quantitative evaluation of quotations in the Russian Wiktionary was performed using the developed Wiktionary parser. It was found that the number of quotations in the dictionary is growing fast (51.5 thousands in 2011, 62 thousands in 2012). These quotations were extracted and saved in the relational database of a machine-readable dictionary. For this database, tables related to the quotations… ▽ More The quantitative evaluation of quotations in the Russian Wiktionary was performed using the developed Wiktionary parser. It was found that the number of quotations in the dictionary is growing fast (51.5 thousands in 2011, 62 thousands in 2012). These quotations were extracted and saved in the relational database of a machine-readable dictionary. For this database, tables related to the quotations were designed. A histogram of distribution of quotations of literary works written in different years was built. It was made an attempt to explain the characteristics of the histogram by associating it with the years of the most popular and cited (in the Russian Wiktionary) writers of the nineteenth century. It was found that more than one-third of all the quotations (the example sentences) contained in the Russian Wiktionary are taken by the editors of a Wiktionary entry from the Russian National Corpus. △ Less

Submitted 20 January, 2020; originally announced February 2020.

Comments: 12 pages, 3 tables, 5 figures, published in the journal (preprint)

MSC Class: 68T50 ACM Class: H.3.3

Journal ref: Research in Computing Science, Vol. 56, pp. 101-112, 2012

arXiv:2001.04719 [pdf]

Semi-automatic methods for adding words to the dictionary of VepKar corpus based on inflectional rules extracted from Wiktionary

Authors: Natalia Krizhanovsky, Andrew Krizhanovsky

Abstract: The article describes a technique for using English Wiktionary inflection tables for generating word forms for Veps verbs and nominals in the Open corpus of Veps and Karelian languages. The information concerning Karelian and Veps Wiktionary entries with inflection tables is given. The operating principle of the Wiktionary static and dynamic templates is explained with the use of the jogi (river)… ▽ More The article describes a technique for using English Wiktionary inflection tables for generating word forms for Veps verbs and nominals in the Open corpus of Veps and Karelian languages. The information concerning Karelian and Veps Wiktionary entries with inflection tables is given. The operating principle of the Wiktionary static and dynamic templates is explained with the use of the jogi (river) dictionary entry as an example. The method of constructing the inflection table in the dictionary of the VepKar corpus according to the data of the dynamic template of the English Wiktionary is presented. △ Less

Submitted 14 January, 2020; originally announced January 2020.

Comments: 10 pages, 1 table, 2 figures, published in the conference proceeding https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf#page=211

Journal ref: Corpora 2019, 24-28 June, 2019. Saint-Petersburg. P. 211-217

arXiv:1805.09559 [pdf, other]

doi 10.17076/mat829

WSD algorithm based on a new method of vector-word contexts proximity calculation via epsilon-filtration

Authors: Alexander Kirillov, Natalia Krizhanovsky, Andrew Krizhanovsky

Abstract: The problem of word sense disambiguation (WSD) is considered in the article. Given a set of synonyms (synsets) and sentences with these synonyms. It is necessary to select the meaning of the word in the sentence automatically. 1285 sentences were tagged by experts, namely, one of the dictionary meanings was selected by experts for target words. To solve the WSD-problem, an algorithm based on a new… ▽ More The problem of word sense disambiguation (WSD) is considered in the article. Given a set of synonyms (synsets) and sentences with these synonyms. It is necessary to select the meaning of the word in the sentence automatically. 1285 sentences were tagged by experts, namely, one of the dictionary meanings was selected by experts for target words. To solve the WSD-problem, an algorithm based on a new method of vector-word contexts proximity calculation is proposed. In order to achieve higher accuracy, a preliminary epsilon-filtering of words is performed, both in the sentence and in the set of synonyms. An extensive program of experiments was carried out. Four algorithms are implemented, including a new algorithm. Experiments have shown that in a number of cases the new algorithm shows better results. The developed software and the tagged corpus have an open license and are available online. Wiktionary and Wikisource are used. A brief description of this work can be viewed in slides (https://goo.gl/9ak6Gt). Video lecture in Russian on this research is available online (https://youtu.be/-DLmRkepf58). △ Less

Submitted 18 June, 2018; v1 submitted 24 May, 2018; originally announced May 2018.

Comments: 15 pages, 1 table, 15 figures, accepted in the journal Transactions of Karelian Research Centre of the Russian Academy of Sciences

MSC Class: 68T50 ACM Class: I.5.3; H.3.1; H.3.3

Journal ref: Transactions of Karelian Research Centre RAS. No. 7. 2018. P. 149-163

Showing 1–5 of 5 results for author: Krizhanovsky, N