Search | arXiv e-print repository

doi 10.18502/kss.v7i3.10419

The Open corpus of the Veps and Karelian languages: overview and applications

Authors: Tatyana Boyko, Nina Zaitseva, Natalia Krizhanovskaya, Andrew Krizhanovsky, Irina Novak, Nataliya Pellinen, Aleksandra Rodionova

Abstract: A growing priority in the study of Baltic-Finnic languages of the Republic of Karelia has been the methods and tools of corpus linguistics. Since 2016, linguists, mathematicians, and programmers at the Karelian Research Centre have been working with the Open Corpus of the Veps and Karelian Languages (VepKar), which is an extension of the Veps Corpus created in 2009. The VepKar corpus comprises tex… ▽ More A growing priority in the study of Baltic-Finnic languages of the Republic of Karelia has been the methods and tools of corpus linguistics. Since 2016, linguists, mathematicians, and programmers at the Karelian Research Centre have been working with the Open Corpus of the Veps and Karelian Languages (VepKar), which is an extension of the Veps Corpus created in 2009. The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search using various criteria of the texts (language, genre, etc.) and numerous linguistic categories (lexical and grammatical search in texts was implemented thanks to the generator of word forms that we created earlier). A corpus of 3000 texts was compiled, texts were uploaded and marked up, the system for classifying texts into languages, dialects, types and genres was introduced, and the word-form generator was created. Future plans include develo** a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs. Owing to continuous functional advancements in the corpus manager and ongoing VepKar enrichment with new material and text markup, users can handle a wide range of scientific and applied tasks. In creating the universal national VepKar corpus, its developers and managers strive to preserve and exhibit as fully as possible the state of the Veps and Karelian languages in the 19th-21st centuries. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: 9 pages, 9 figures, published in the journal

MSC Class: 68T50 ACM Class: H.3.1; H.3.6

Journal ref: KnE Social Sciences. 7 (3). 2022. P. 29-40

arXiv:2205.03608 [pdf, other]

UniMorph 4.0: Universal Morphology

Authors: Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay , et al. (71 additional authors not shown)

Abstract: The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa… ▽ More The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet. △ Less

Submitted 19 June, 2022; v1 submitted 7 May, 2022; originally announced May 2022.

Comments: LREC 2022; The first two authors made equal contributions

arXiv:2103.11859 [pdf, other]

Part of speech and gramset tagging algorithms for unknown words based on morphological dictionaries of the Veps and Karelian languages

Authors: Andrew Krizhanovsky, Natalia Krizhanovsky, Irina Novak

Abstract: This research devoted to the low-resource Veps and Karelian languages. Algorithms for assigning part of speech tags to words and grammatical properties to words are presented in the article. These algorithms use our morphological dictionaries, where the lemma, part of speech and a set of grammatical features (gramset) are known for each word form. The algorithms are based on the analogy hypothesis… ▽ More This research devoted to the low-resource Veps and Karelian languages. Algorithms for assigning part of speech tags to words and grammatical properties to words are presented in the article. These algorithms use our morphological dictionaries, where the lemma, part of speech and a set of grammatical features (gramset) are known for each word form. The algorithms are based on the analogy hypothesis that words with the same suffixes are likely to have the same inflectional models, the same part of speech and gramset. The accuracy of these algorithms were evaluated and compared. 313 thousand Vepsian and 66 thousand Karelian words were used to verify the accuracy of these algorithms. The special functions were designed to assess the quality of results of the developed algorithms. 92.4% of Vepsian words and 86.8% of Karelian words were assigned a correct part of speech by the developed algorithm. 95.3% of Vepsian words and 90.7% of Karelian words were assigned a correct gramset by our algorithm. Morphological and semantic tagging of texts, which are closely related and inseparable in our corpus processes, are described in the paper. △ Less

Submitted 22 March, 2021; originally announced March 2021.

Comments: 17 pages, 4 tables, 7 figures, published in the conference proceeding

MSC Class: 68T50 ACM Class: H.3.1; H.3.6

arXiv:2006.11572 [pdf, other]

SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

Authors: Ekaterina Vylomova, Jennifer White, Elizabeth Salesky, Sabrina J. Mielke, Shijie Wu, Edoardo Ponti, Rowan Hall Maudslay, Ran Zmigrod, Josef Valvoda, Svetlana Toldova, Francis Tyers, Elena Klyachko, Ilya Yegorov, Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor Chodroff , et al. (3 additional authors not shown)

Abstract: A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages, many of which are low resource… ▽ More A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging. △ Less

Submitted 14 July, 2020; v1 submitted 20 June, 2020; originally announced June 2020.

Comments: 39 pages, SIGMORPHON

arXiv:2002.00734 [pdf]

Analysis of the quotation corpus of the Russian Wiktionary

Authors: A. Smirnov, T. Levashova, A. Karpov, I. Kipyatkova, A. Ronzhin, A. Krizhanovsky, N. Krizhanovsky

Abstract: The quantitative evaluation of quotations in the Russian Wiktionary was performed using the developed Wiktionary parser. It was found that the number of quotations in the dictionary is growing fast (51.5 thousands in 2011, 62 thousands in 2012). These quotations were extracted and saved in the relational database of a machine-readable dictionary. For this database, tables related to the quotations… ▽ More The quantitative evaluation of quotations in the Russian Wiktionary was performed using the developed Wiktionary parser. It was found that the number of quotations in the dictionary is growing fast (51.5 thousands in 2011, 62 thousands in 2012). These quotations were extracted and saved in the relational database of a machine-readable dictionary. For this database, tables related to the quotations were designed. A histogram of distribution of quotations of literary works written in different years was built. It was made an attempt to explain the characteristics of the histogram by associating it with the years of the most popular and cited (in the Russian Wiktionary) writers of the nineteenth century. It was found that more than one-third of all the quotations (the example sentences) contained in the Russian Wiktionary are taken by the editors of a Wiktionary entry from the Russian National Corpus. △ Less

Submitted 20 January, 2020; originally announced February 2020.

Comments: 12 pages, 3 tables, 5 figures, published in the journal (preprint)

MSC Class: 68T50 ACM Class: H.3.3

Journal ref: Research in Computing Science, Vol. 56, pp. 101-112, 2012

arXiv:2001.11285 [pdf, other]

LowResourceEval-2019: a shared task on morphological analysis for low-resource languages

Authors: Elena Klyachko, Alexey Sorokin, Natalia Krizhanovskaya, Andrew Krizhanovsky, Galina Ryazanskaya

Abstract: The paper describes the results of the first shared task on morphological analysis for the languages of Russia, namely, Evenki, Karelian, Selkup, and Veps. For the languages in question, only small-sized corpora are available. The tasks include morphological analysis, word form generation and morpheme segmentation. Four teams participated in the shared task. Most of them use machine-learning appro… ▽ More The paper describes the results of the first shared task on morphological analysis for the languages of Russia, namely, Evenki, Karelian, Selkup, and Veps. For the languages in question, only small-sized corpora are available. The tasks include morphological analysis, word form generation and morpheme segmentation. Four teams participated in the shared task. Most of them use machine-learning approaches, outperforming the existing rule-based ones. The article describes the datasets prepared for the shared tasks and contains analysis of the participants' solutions. Language corpora having different formats were transformed into CONLL-U format. The universal format makes the datasets comparable to other language corpura and facilitates using them in other NLP tasks. △ Less

Submitted 30 January, 2020; originally announced January 2020.

Comments: 16 pages, 4 tables, 2 figures, published in the conference proceeding

MSC Class: 68T50

Journal ref: Dialog 2019, Issue 18, Supplementary volume, Pp. 45-62

arXiv:2001.04719 [pdf]

Semi-automatic methods for adding words to the dictionary of VepKar corpus based on inflectional rules extracted from Wiktionary

Authors: Natalia Krizhanovsky, Andrew Krizhanovsky

Abstract: The article describes a technique for using English Wiktionary inflection tables for generating word forms for Veps verbs and nominals in the Open corpus of Veps and Karelian languages. The information concerning Karelian and Veps Wiktionary entries with inflection tables is given. The operating principle of the Wiktionary static and dynamic templates is explained with the use of the jogi (river)… ▽ More The article describes a technique for using English Wiktionary inflection tables for generating word forms for Veps verbs and nominals in the Open corpus of Veps and Karelian languages. The information concerning Karelian and Veps Wiktionary entries with inflection tables is given. The operating principle of the Wiktionary static and dynamic templates is explained with the use of the jogi (river) dictionary entry as an example. The method of constructing the inflection table in the dictionary of the VepKar corpus according to the data of the dynamic template of the English Wiktionary is presented. △ Less

Submitted 14 January, 2020; originally announced January 2020.

Comments: 10 pages, 1 table, 2 figures, published in the conference proceeding https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf#page=211

Journal ref: Corpora 2019, 24-28 June, 2019. Saint-Petersburg. P. 211-217

arXiv:1805.09559 [pdf, other]

doi 10.17076/mat829

WSD algorithm based on a new method of vector-word contexts proximity calculation via epsilon-filtration

Authors: Alexander Kirillov, Natalia Krizhanovsky, Andrew Krizhanovsky

Abstract: The problem of word sense disambiguation (WSD) is considered in the article. Given a set of synonyms (synsets) and sentences with these synonyms. It is necessary to select the meaning of the word in the sentence automatically. 1285 sentences were tagged by experts, namely, one of the dictionary meanings was selected by experts for target words. To solve the WSD-problem, an algorithm based on a new… ▽ More The problem of word sense disambiguation (WSD) is considered in the article. Given a set of synonyms (synsets) and sentences with these synonyms. It is necessary to select the meaning of the word in the sentence automatically. 1285 sentences were tagged by experts, namely, one of the dictionary meanings was selected by experts for target words. To solve the WSD-problem, an algorithm based on a new method of vector-word contexts proximity calculation is proposed. In order to achieve higher accuracy, a preliminary epsilon-filtering of words is performed, both in the sentence and in the set of synonyms. An extensive program of experiments was carried out. Four algorithms are implemented, including a new algorithm. Experiments have shown that in a number of cases the new algorithm shows better results. The developed software and the tagged corpus have an open license and are available online. Wiktionary and Wikisource are used. A brief description of this work can be viewed in slides (https://goo.gl/9ak6Gt). Video lecture in Russian on this research is available online (https://youtu.be/-DLmRkepf58). △ Less

Submitted 18 June, 2018; v1 submitted 24 May, 2018; originally announced May 2018.

Comments: 15 pages, 1 table, 15 figures, accepted in the journal Transactions of Karelian Research Centre of the Russian Academy of Sciences

MSC Class: 68T50 ACM Class: I.5.3; H.3.1; H.3.3

Journal ref: Transactions of Karelian Research Centre RAS. No. 7. 2018. P. 149-163

arXiv:1803.01580 [pdf, ps, other]

Calculated attributes of synonym sets

Authors: Andrew Krizhanovsky, Alexander Kirillov

Abstract: The goal of formalization, proposed in this paper, is to bring together, as near as possible, the theoretic linguistic problem of synonym conception and the computer linguistic methods based generally on empirical intuitive unjustified factors. Using the word vector representation we have proposed the geometric approach to mathematical modeling of synonym set (synset). The word embedding is based… ▽ More The goal of formalization, proposed in this paper, is to bring together, as near as possible, the theoretic linguistic problem of synonym conception and the computer linguistic methods based generally on empirical intuitive unjustified factors. Using the word vector representation we have proposed the geometric approach to mathematical modeling of synonym set (synset). The word embedding is based on the neural networks (Skip-gram, CBOW), developed and realized as word2vec program by T. Mikolov. The standard cosine similarity is used as the distance between word-vectors. Several geometric characteristics of the synset words are introduced: the interior of synset, the synset word rank and centrality. These notions are intended to select the most significant synset words, i.e. the words which senses are the nearest to the sense of a synset. Some experiments with proposed notions, based on RusVectores resources, are represented. A brief description of this work can be viewed in slides https://goo.gl/K82Fei △ Less

Submitted 5 March, 2018; originally announced March 2018.

Comments: 6 pages, 2 tables, 2 figures, preprint

MSC Class: 68T50 ACM Class: I.5.3; H.3.1; H.3.3

arXiv:1109.0732 [pdf]

Multilingual ontology matching based on Wiktionary data accessible via SPARQL endpoint

Authors: Feiyu Lin, Andrew Krizhanovsky

Abstract: Interoperability is a feature required by the Semantic Web. It is provided by the ontology matching methods and algorithms. But now ontologies are presented not only in English, but in other languages as well. It is important to use an automatic translation for obtaining correct matching pairs in multilingual ontology matching. The translation into many languages could be based on the Google Trans… ▽ More Interoperability is a feature required by the Semantic Web. It is provided by the ontology matching methods and algorithms. But now ontologies are presented not only in English, but in other languages as well. It is important to use an automatic translation for obtaining correct matching pairs in multilingual ontology matching. The translation into many languages could be based on the Google Translate API, the Wiktionary database, etc. From the point of view of the balance of presence of many languages, of manually crafted translations, of a huge size of a dictionary, the most promising resource is the Wiktionary. It is a collaborative project working on the same principles as the Wikipedia. The parser of the Wiktionary was developed and the machine-readable dictionary was designed. The data of the machine-readable Wiktionary are stored in a relational database, but with the help of D2R server the database is presented as an RDF store. Thus, it is possible to get lexicographic information (definitions, translations, synonyms) from web service using SPARQL requests. In the case study, the problem entity is a task of multilingual ontology matching based on Wiktionary data accessible via SPARQL endpoint. Ontology matching results obtained using Wiktionary were compared with results based on Google Translate API. △ Less

Submitted 26 October, 2011; v1 submitted 4 September, 2011; originally announced September 2011.

Comments: 8 pages, 3 tables, 4 figures, In: Proceedings of the 13th Russian Conference on Digital Libraries RCDL'2011. October 19-22, Voronezh, Russia. - pp. 19-26. (preprint)

MSC Class: 68W25; 90C35 ACM Class: I.7.2; I.7.3; I.7.5; H.3.1; H.3.3

arXiv:1011.1368 [pdf]

Transformation of Wiktionary entry structure into tables and relations in a relational database schema

Authors: A. A. Krizhanovsky

Abstract: This paper addresses the question of automatic data extraction from the Wiktionary, which is a multilingual and multifunctional dictionary. Wiktionary is a collaborative project working on the same principles as the Wikipedia. The Wiktionary entry is a plain text from the text processing point of view. Wiktionary guidelines prescribe the entry layout and rules, which should be followed by editors… ▽ More This paper addresses the question of automatic data extraction from the Wiktionary, which is a multilingual and multifunctional dictionary. Wiktionary is a collaborative project working on the same principles as the Wikipedia. The Wiktionary entry is a plain text from the text processing point of view. Wiktionary guidelines prescribe the entry layout and rules, which should be followed by editors of the dictionary. The presence of the structure of a Wiktionary article and formatting rules allows transforming the Wiktionary entry structure into tables and relations in a relational database schema, which is a part of a machine-readable dictionary (MRD). The paper describes how the flat text of the Wiktionary entry was extracted, converted, and stored in the specially designed relational database. The MRD contains the definitions, semantic relations, and translations extracted from the English and Russian Wiktionaries. The parser software is released under the open source license agreement (GPL), to facilitate its dissemination, modification and upgrades, to draw researchers and programmers into parsing other Wiktionaries, not only Russian and English. △ Less

Submitted 5 November, 2010; originally announced November 2010.

Comments: 10 pages, 7 figures, preprint

MSC Class: 68W25; 90C35 ACM Class: I.7.2; I.7.3; I.7.5; H.3.1; H.3.3

arXiv:1006.5040 [pdf, other]

The comparison of Wiktionary thesauri transformed into the machine-readable format

Authors: A. A. Krizhanovsky

Abstract: Wiktionary is a unique, peculiar, valuable and original resource for natural language processing (NLP). The paper describes an open-source Wiktionary parser: its architecture and requirements followed by a description of Wiktionary features to be taken into account, some open problems of Wiktionary and the parser. The current implementation of the parser extracts the definitions, semantic relation… ▽ More Wiktionary is a unique, peculiar, valuable and original resource for natural language processing (NLP). The paper describes an open-source Wiktionary parser: its architecture and requirements followed by a description of Wiktionary features to be taken into account, some open problems of Wiktionary and the parser. The current implementation of the parser extracts the definitions, semantic relations, and translations from English and Russian Wiktionaries. The paper's goal is to interest researchers (1) in using the constructed machine-readable dictionary for different NLP tasks, (2) in extending the software to parse 170 still unused Wiktionaries. The comparison of a number and types of semantic relations, a number of definitions, and a number of translations in the English Wiktionary and the Russian Wiktionary has been carried out. It was found that the number of semantic relations in the English Wiktionary is larger by 1.57 times than in Russian (157 and 100 thousands). But the Russian Wiktionary has more "rich" entries (with a big number of semantic relations), e.g. the number of entries with three or more semantic relations is larger by 1.63 times than in the English Wiktionary. Upon comparison, it was found out the methodological shortcomings of the Wiktionary. △ Less

Submitted 25 June, 2010; originally announced June 2010.

Comments: 23 pages, 3 tables, 6 figures, preprint

MSC Class: 68W25; 90C35 ACM Class: I.7.2; I.7.3; I.7.5; H.3.1; H.3.3

arXiv:0907.2209 [pdf]

Related terms search based on WordNet / Wiktionary and its application in Ontology Matching

Authors: A. A. Krizhanovsky, Feiyu Lin

Abstract: A set of ontology matching algorithms (for finding correspondences between concepts) is based on a thesaurus that provides the source data for the semantic distance calculations. In this wiki era, new resources may spring up and improve this kind of semantic search. In the paper a solution of this task based on Russian Wiktionary is compared to WordNet based algorithms. Metrics are estimated usi… ▽ More A set of ontology matching algorithms (for finding correspondences between concepts) is based on a thesaurus that provides the source data for the semantic distance calculations. In this wiki era, new resources may spring up and improve this kind of semantic search. In the paper a solution of this task based on Russian Wiktionary is compared to WordNet based algorithms. Metrics are estimated using the test collection, containing 353 English word pairs with a relatedness score assigned by human evaluators. The experiment shows that the proposed method is capable in principle of calculating a semantic distance between pair of words in any language presented in Russian Wiktionary. The calculation of Wiktionary based metric had required the development of the open-source Wiktionary parser software. △ Less

Submitted 12 October, 2009; v1 submitted 13 July, 2009; originally announced July 2009.

Comments: 7 pages, 2 tables, 3 figures; In: RCDL 2009. September 17-21, Petrozavodsk, Russia. - pp. 363-369

ACM Class: I.7.2; I.7.3; I.7.5; H.3.1; H.3.3

arXiv:0808.1753 [pdf]

Index wiki database: design and experiments

Authors: A. A. Krizhanovsky

Abstract: With the fantastic growth of Internet usage, information search in documents of a special type called a "wiki page" that is written using a simple markup language, has become an important problem. This paper describes the software architectural model for indexing wiki texts in three languages (Russian, English, and German) and the interaction between the software components (GATE, Lemmatizer, an… ▽ More With the fantastic growth of Internet usage, information search in documents of a special type called a "wiki page" that is written using a simple markup language, has become an important problem. This paper describes the software architectural model for indexing wiki texts in three languages (Russian, English, and German) and the interaction between the software components (GATE, Lemmatizer, and Synarcher). The inverted file index database was designed using visual tool DBDesigner. The rules for parsing Wikipedia texts are illustrated by examples. Two index databases of Russian Wikipedia (RW) and Simple English Wikipedia (SEW) are built and compared. The size of RW is by order of magnitude higher than SEW (number of words, lexemes), though the growth rate of number of pages in SEW was found to be 14% higher than in Russian, and the rate of acquisition of new words in SEW lexicon was 7% higher during a period of five months (from September 2007 to February 2008). The Zipf's law was tested with both Russian and Simple Wikipedias. The entire source code of the indexing software and the generated index databases are freely available under GPL (GNU General Public License). △ Less

Submitted 23 September, 2008; v1 submitted 12 August, 2008; originally announced August 2008.

Comments: 18 pages, 4 tables, 4 figures; FLINS'08, Corpus Linguistics'08, AIS/CAD'08; v2: table 3 changed

ACM Class: I.7.2; I.7.3; I.7.5; H.3.1; H.3.3

arXiv:0804.2354 [pdf]

Information filtering based on wiki index database

Authors: A. V. Smirnov, A. A. Krizhanovsky

Abstract: In this paper we present a profile-based approach to information filtering by an analysis of the content of text documents. The Wikipedia index database is created and used to automatically generate the user profile from the user document collection. The problem-oriented Wikipedia subcorpora are created (using knowledge extracted from the user profile) for each topic of user interests. The index… ▽ More In this paper we present a profile-based approach to information filtering by an analysis of the content of text documents. The Wikipedia index database is created and used to automatically generate the user profile from the user document collection. The problem-oriented Wikipedia subcorpora are created (using knowledge extracted from the user profile) for each topic of user interests. The index databases of these subcorpora are applied to filtering information flow (e.g., mails, news). Thus, the analyzed texts are classified into several topics explicitly presented in the user profile. The paper concentrates on the indexing part of the approach. The architecture of an application implementing the Wikipedia indexing is described. The indexing method is evaluated using the Russian and Simple English Wikipedia. △ Less

Submitted 8 May, 2008; v1 submitted 15 April, 2008; originally announced April 2008.

Comments: 9 pages, 1 table, 2 figures, 8th International FLINS Conference on Computational Intelligence in Decision and Control, Madrid, Spain, September 21-24, 2008; v2: typo

ACM Class: I.7.2; I.7.3; I.7.5; H.3.1; H.3.3

arXiv:0710.0169 [pdf]

Evaluation experiments on related terms search in Wikipedia: Information Content and Adapted HITS (In Russian)

Authors: A. A. Krizhanovsky

Abstract: The classification of metrics and algorithms search for related terms via WordNet, Roget's Thesaurus, and Wikipedia was extended to include adapted HITS algorithm. Evaluation experiments on Information Content and adapted HITS algorithm are described. The test collection of Russian word pairs with human-assigned similarity judgments is proposed. ----- Klassifikacija metrik i algoritmov poisk… ▽ More The classification of metrics and algorithms search for related terms via WordNet, Roget's Thesaurus, and Wikipedia was extended to include adapted HITS algorithm. Evaluation experiments on Information Content and adapted HITS algorithm are described. The test collection of Russian word pairs with human-assigned similarity judgments is proposed. ----- Klassifikacija metrik i algoritmov poiska semanticheski blizkih slov v tezaurusah WordNet, Rozhe i jenciklopedii Vikipedija rasshirena adaptirovannym HITS algoritmom. S pomow'ju jeksperimentov v Vikipedii oceneny metrika Information Content i adaptirovannyj algoritm HITS. Predlozhen resurs dlja ocenki semanticheskoj blizosti russkih slov. △ Less

Submitted 16 January, 2008; v1 submitted 1 October, 2007; originally announced October 2007.

Comments: 10 pages, 1 figure, 3 tables, in Russian, short version of the paper to be published in Proceedings of the Wiki-Conference 2007, Russia, St. Petersburg, October 27-28. http://tinyurl.com/2czd6e ; v3: +figure; v4: typo in Table 3; v5: +desc (res_hypo formula); v6: typo

ACM Class: H.3.1; H.3.3; H.4.3; G.2.2

arXiv:cs/0610058 [pdf]

Context-sensitive access to e-document corpus

Authors: A. V. Smirnov, T. V. Levashova, M. P. Pashkin, N. G. Shilov, A. A. Krizhanovsky, A. M. Kashevnik, A. S. Komarova

Abstract: The methodology of context-sensitive access to e-documents considers context as a problem model based on the knowledge extracted from the application domain, and presented in the form of application ontology. Efficient access to an information in the text form is needed. Wiki resources as a modern text format provides huge number of text in a semi formalized structure. At the first stage of the… ▽ More The methodology of context-sensitive access to e-documents considers context as a problem model based on the knowledge extracted from the application domain, and presented in the form of application ontology. Efficient access to an information in the text form is needed. Wiki resources as a modern text format provides huge number of text in a semi formalized structure. At the first stage of the methodology, documents are indexed against the ontology representing macro-situation. The indexing method uses a topic tree as a middle layer between documents and the application ontology. At the second stage documents relevant to the current situation (the abstract and operational contexts) are identified and sorted by degree of relevance. Abstract context is a problem-oriented ontology-based model. Operational context is an instantiation of the abstract context with data provided by the information sources. The following parts of the methodology are described: (i) metrics for measuring similarity of e-documents to ontology, (ii) a document index storing results of indexing of e-documents against the ontology; (iii) a method for identification of relevant e-documents based on semantic similarity measures. Wikipedia (wiki resource) is used as a corpus of e-documents for approach evaluation in a case study. Text categorization, the presence of metadata, and an existence of a lot of articles related to different topics characterize the corpus. △ Less

Submitted 11 October, 2006; originally announced October 2006.

Comments: 9 pages, 1 figure, short version of this paper was presented at the International Conference Corpus Linguistics 2006. October 10-14, St. Petersburg, Russia

ACM Class: H.3.1; H.3.3; H.4.3; G.2.2

arXiv:cs/0606128 [pdf]

Automatic forming lists of semantically related terms based on texts rating in the corpus with hyperlinks and categories (In Russian)

Authors: A. Krizhanovsky

Abstract: HITS adapted algorithm for synonym search, the program architecture, and the program work evaluation with test examples are presented in the paper. Synarcher program for synonym (and related terms) search in the text corpus of special structure (Wikipedia) was developed. The results of search are presented in the form of a graph. It is possible to explore the graph and search graph elements inte… ▽ More HITS adapted algorithm for synonym search, the program architecture, and the program work evaluation with test examples are presented in the paper. Synarcher program for synonym (and related terms) search in the text corpus of special structure (Wikipedia) was developed. The results of search are presented in the form of a graph. It is possible to explore the graph and search graph elements interactively. The proposed algorithm could be applied to the search request extending and for synonym dictionary forming. △ Less

Submitted 30 June, 2006; originally announced June 2006.

Comments: 6 pages, 1 figure, in Russian, PDF, for other formats see http://whinger.narod.ru/paper/index.html

ACM Class: H.3.1; H.3.3; H.4.3; G.2.2

arXiv:cs/0606097 [pdf, ps, other]

Synonym search in Wikipedia: Synarcher

Authors: A. Krizhanovsky

Abstract: The program Synarcher for synonym (and related terms) search in the text corpus of special structure (Wikipedia) was developed. The results of the search are presented in the form of graph. It is possible to explore the graph and search for graph elements interactively. Adapted HITS algorithm for synonym search, program architecture, and program work evaluation with test examples are presented i… ▽ More The program Synarcher for synonym (and related terms) search in the text corpus of special structure (Wikipedia) was developed. The results of the search are presented in the form of graph. It is possible to explore the graph and search for graph elements interactively. Adapted HITS algorithm for synonym search, program architecture, and program work evaluation with test examples are presented in the paper. The proposed algorithm can be applied to a query expansion by synonyms (in a search engine) and a synonym dictionary forming. △ Less

Submitted 23 June, 2006; v1 submitted 22 June, 2006; originally announced June 2006.

Comments: 4 pages, 2 figures, Synarcher program is available at http://synarcher.sourceforge.net

ACM Class: H.3.1; H.3.3; H.4.3; G.2.2

arXiv:cs/0501077 [pdf]

Ontology-Based Users & Requests Clustering in Customer Service Management System

Authors: Alexander Smirnov, Mikhail Pashkin, Nikolai Chilov, Tatiana Levashova, Andrew Krizhanovsky, Alexey Kashevnik

Abstract: Customer Service Management is one of major business activities to better serve company customers through the introduction of reliable processes and procedures. Today this kind of activities is implemented through e-services to directly involve customers into business processes. Traditionally Customer Service Management involves application of data mining techniques to discover usage patterns fr… ▽ More Customer Service Management is one of major business activities to better serve company customers through the introduction of reliable processes and procedures. Today this kind of activities is implemented through e-services to directly involve customers into business processes. Traditionally Customer Service Management involves application of data mining techniques to discover usage patterns from the company knowledge memory. Hence grou** of customers/requests to clusters is one of major technique to improve the level of company customization. The goal of this paper is to present an efficient for implementation approach for clustering users and their requests. The approach uses ontology as knowledge representation model to improve the semantic interoperability between units of the company and customers. Some fragments of the approach tested in an industrial company are also presented in the paper. △ Less

Submitted 27 May, 2005; v1 submitted 26 January, 2005; originally announced January 2005.

Comments: 15 pages, 4 figures, published in Lecture Notes in Computer Science

ACM Class: H.3.3

Journal ref: Smirnov A., Pashkin M., Chilov N., Levashova T., Krizhanovsky A., Kashevnik A. 2005. Ontology-Based Users and Requests Clustering in Customer Service Management System. Springer-Verlag GmbH, Lecture Notes in Computer Science, 3505: 231-246

Showing 1–20 of 20 results for author: Krizhanovsky, A