-
The Open corpus of the Veps and Karelian languages: overview and applications
Authors:
Tatyana Boyko,
Nina Zaitseva,
Natalia Krizhanovskaya,
Andrew Krizhanovsky,
Irina Novak,
Nataliya Pellinen,
Aleksandra Rodionova
Abstract:
A growing priority in the study of Baltic-Finnic languages of the Republic of Karelia has been the methods and tools of corpus linguistics. Since 2016, linguists, mathematicians, and programmers at the Karelian Research Centre have been working with the Open Corpus of the Veps and Karelian Languages (VepKar), which is an extension of the Veps Corpus created in 2009. The VepKar corpus comprises tex…
▽ More
A growing priority in the study of Baltic-Finnic languages of the Republic of Karelia has been the methods and tools of corpus linguistics. Since 2016, linguists, mathematicians, and programmers at the Karelian Research Centre have been working with the Open Corpus of the Veps and Karelian Languages (VepKar), which is an extension of the Veps Corpus created in 2009. The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search using various criteria of the texts (language, genre, etc.) and numerous linguistic categories (lexical and grammatical search in texts was implemented thanks to the generator of word forms that we created earlier). A corpus of 3000 texts was compiled, texts were uploaded and marked up, the system for classifying texts into languages, dialects, types and genres was introduced, and the word-form generator was created. Future plans include develo** a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs. Owing to continuous functional advancements in the corpus manager and ongoing VepKar enrichment with new material and text markup, users can handle a wide range of scientific and applied tasks. In creating the universal national VepKar corpus, its developers and managers strive to preserve and exhibit as fully as possible the state of the Veps and Karelian languages in the 19th-21st centuries.
△ Less
Submitted 8 June, 2022;
originally announced June 2022.
-
UniMorph 4.0: Universal Morphology
Authors:
Khuyagbaatar Batsuren,
Omer Goldman,
Salam Khalifa,
Nizar Habash,
Witold Kieraś,
Gábor Bella,
Brian Leonard,
Garrett Nicolai,
Kyle Gorman,
Yustinus Ghanggo Ate,
Maria Ryskina,
Sabrina J. Mielke,
Elena Budianskaya,
Charbel El-Khaissi,
Tiago Pimentel,
Michael Gasser,
William Lane,
Mohit Raj,
Matt Coler,
Jaime Rafael Montoya Samame,
Delio Siticonatzi Camaiteri,
Benoît Sagot,
Esaú Zumaeta Rojas,
Didier López Francis,
Arturo Oncevay
, et al. (71 additional authors not shown)
Abstract:
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa…
▽ More
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.
△ Less
Submitted 19 June, 2022; v1 submitted 7 May, 2022;
originally announced May 2022.
-
Part of speech and gramset tagging algorithms for unknown words based on morphological dictionaries of the Veps and Karelian languages
Authors:
Andrew Krizhanovsky,
Natalia Krizhanovsky,
Irina Novak
Abstract:
This research devoted to the low-resource Veps and Karelian languages. Algorithms for assigning part of speech tags to words and grammatical properties to words are presented in the article. These algorithms use our morphological dictionaries, where the lemma, part of speech and a set of grammatical features (gramset) are known for each word form. The algorithms are based on the analogy hypothesis…
▽ More
This research devoted to the low-resource Veps and Karelian languages. Algorithms for assigning part of speech tags to words and grammatical properties to words are presented in the article. These algorithms use our morphological dictionaries, where the lemma, part of speech and a set of grammatical features (gramset) are known for each word form. The algorithms are based on the analogy hypothesis that words with the same suffixes are likely to have the same inflectional models, the same part of speech and gramset. The accuracy of these algorithms were evaluated and compared. 313 thousand Vepsian and 66 thousand Karelian words were used to verify the accuracy of these algorithms. The special functions were designed to assess the quality of results of the developed algorithms. 92.4% of Vepsian words and 86.8% of Karelian words were assigned a correct part of speech by the developed algorithm. 95.3% of Vepsian words and 90.7% of Karelian words were assigned a correct gramset by our algorithm. Morphological and semantic tagging of texts, which are closely related and inseparable in our corpus processes, are described in the paper.
△ Less
Submitted 22 March, 2021;
originally announced March 2021.
-
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
Authors:
Ekaterina Vylomova,
Jennifer White,
Elizabeth Salesky,
Sabrina J. Mielke,
Shijie Wu,
Edoardo Ponti,
Rowan Hall Maudslay,
Ran Zmigrod,
Josef Valvoda,
Svetlana Toldova,
Francis Tyers,
Elena Klyachko,
Ilya Yegorov,
Natalia Krizhanovsky,
Paula Czarnowska,
Irene Nikkarinen,
Andrew Krizhanovsky,
Tiago Pimentel,
Lucas Torroba Hennigen,
Christo Kirov,
Garrett Nicolai,
Adina Williams,
Antonios Anastasopoulos,
Hilaria Cruz,
Eleanor Chodroff
, et al. (3 additional authors not shown)
Abstract:
A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages, many of which are low resource…
▽ More
A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.
△ Less
Submitted 14 July, 2020; v1 submitted 20 June, 2020;
originally announced June 2020.
-
Analysis of the quotation corpus of the Russian Wiktionary
Authors:
A. Smirnov,
T. Levashova,
A. Karpov,
I. Kipyatkova,
A. Ronzhin,
A. Krizhanovsky,
N. Krizhanovsky
Abstract:
The quantitative evaluation of quotations in the Russian Wiktionary was performed using the developed Wiktionary parser. It was found that the number of quotations in the dictionary is growing fast (51.5 thousands in 2011, 62 thousands in 2012). These quotations were extracted and saved in the relational database of a machine-readable dictionary. For this database, tables related to the quotations…
▽ More
The quantitative evaluation of quotations in the Russian Wiktionary was performed using the developed Wiktionary parser. It was found that the number of quotations in the dictionary is growing fast (51.5 thousands in 2011, 62 thousands in 2012). These quotations were extracted and saved in the relational database of a machine-readable dictionary. For this database, tables related to the quotations were designed. A histogram of distribution of quotations of literary works written in different years was built. It was made an attempt to explain the characteristics of the histogram by associating it with the years of the most popular and cited (in the Russian Wiktionary) writers of the nineteenth century. It was found that more than one-third of all the quotations (the example sentences) contained in the Russian Wiktionary are taken by the editors of a Wiktionary entry from the Russian National Corpus.
△ Less
Submitted 20 January, 2020;
originally announced February 2020.
-
LowResourceEval-2019: a shared task on morphological analysis for low-resource languages
Authors:
Elena Klyachko,
Alexey Sorokin,
Natalia Krizhanovskaya,
Andrew Krizhanovsky,
Galina Ryazanskaya
Abstract:
The paper describes the results of the first shared task on morphological analysis for the languages of Russia, namely, Evenki, Karelian, Selkup, and Veps. For the languages in question, only small-sized corpora are available. The tasks include morphological analysis, word form generation and morpheme segmentation. Four teams participated in the shared task. Most of them use machine-learning appro…
▽ More
The paper describes the results of the first shared task on morphological analysis for the languages of Russia, namely, Evenki, Karelian, Selkup, and Veps. For the languages in question, only small-sized corpora are available. The tasks include morphological analysis, word form generation and morpheme segmentation. Four teams participated in the shared task. Most of them use machine-learning approaches, outperforming the existing rule-based ones. The article describes the datasets prepared for the shared tasks and contains analysis of the participants' solutions. Language corpora having different formats were transformed into CONLL-U format. The universal format makes the datasets comparable to other language corpura and facilitates using them in other NLP tasks.
△ Less
Submitted 30 January, 2020;
originally announced January 2020.
-
Semi-automatic methods for adding words to the dictionary of VepKar corpus based on inflectional rules extracted from Wiktionary
Authors:
Natalia Krizhanovsky,
Andrew Krizhanovsky
Abstract:
The article describes a technique for using English Wiktionary inflection tables for generating word forms for Veps verbs and nominals in the Open corpus of Veps and Karelian languages. The information concerning Karelian and Veps Wiktionary entries with inflection tables is given. The operating principle of the Wiktionary static and dynamic templates is explained with the use of the jogi (river)…
▽ More
The article describes a technique for using English Wiktionary inflection tables for generating word forms for Veps verbs and nominals in the Open corpus of Veps and Karelian languages. The information concerning Karelian and Veps Wiktionary entries with inflection tables is given. The operating principle of the Wiktionary static and dynamic templates is explained with the use of the jogi (river) dictionary entry as an example. The method of constructing the inflection table in the dictionary of the VepKar corpus according to the data of the dynamic template of the English Wiktionary is presented.
△ Less
Submitted 14 January, 2020;
originally announced January 2020.
-
WSD algorithm based on a new method of vector-word contexts proximity calculation via epsilon-filtration
Authors:
Alexander Kirillov,
Natalia Krizhanovsky,
Andrew Krizhanovsky
Abstract:
The problem of word sense disambiguation (WSD) is considered in the article. Given a set of synonyms (synsets) and sentences with these synonyms. It is necessary to select the meaning of the word in the sentence automatically. 1285 sentences were tagged by experts, namely, one of the dictionary meanings was selected by experts for target words. To solve the WSD-problem, an algorithm based on a new…
▽ More
The problem of word sense disambiguation (WSD) is considered in the article. Given a set of synonyms (synsets) and sentences with these synonyms. It is necessary to select the meaning of the word in the sentence automatically. 1285 sentences were tagged by experts, namely, one of the dictionary meanings was selected by experts for target words. To solve the WSD-problem, an algorithm based on a new method of vector-word contexts proximity calculation is proposed. In order to achieve higher accuracy, a preliminary epsilon-filtering of words is performed, both in the sentence and in the set of synonyms. An extensive program of experiments was carried out. Four algorithms are implemented, including a new algorithm. Experiments have shown that in a number of cases the new algorithm shows better results. The developed software and the tagged corpus have an open license and are available online. Wiktionary and Wikisource are used. A brief description of this work can be viewed in slides (https://goo.gl/9ak6Gt). Video lecture in Russian on this research is available online (https://youtu.be/-DLmRkepf58).
△ Less
Submitted 18 June, 2018; v1 submitted 24 May, 2018;
originally announced May 2018.
-
Calculated attributes of synonym sets
Authors:
Andrew Krizhanovsky,
Alexander Kirillov
Abstract:
The goal of formalization, proposed in this paper, is to bring together, as near as possible, the theoretic linguistic problem of synonym conception and the computer linguistic methods based generally on empirical intuitive unjustified factors. Using the word vector representation we have proposed the geometric approach to mathematical modeling of synonym set (synset). The word embedding is based…
▽ More
The goal of formalization, proposed in this paper, is to bring together, as near as possible, the theoretic linguistic problem of synonym conception and the computer linguistic methods based generally on empirical intuitive unjustified factors. Using the word vector representation we have proposed the geometric approach to mathematical modeling of synonym set (synset). The word embedding is based on the neural networks (Skip-gram, CBOW), developed and realized as word2vec program by T. Mikolov. The standard cosine similarity is used as the distance between word-vectors. Several geometric characteristics of the synset words are introduced: the interior of synset, the synset word rank and centrality. These notions are intended to select the most significant synset words, i.e. the words which senses are the nearest to the sense of a synset. Some experiments with proposed notions, based on RusVectores resources, are represented. A brief description of this work can be viewed in slides https://goo.gl/K82Fei
△ Less
Submitted 5 March, 2018;
originally announced March 2018.
-
Multilingual ontology matching based on Wiktionary data accessible via SPARQL endpoint
Authors:
Feiyu Lin,
Andrew Krizhanovsky
Abstract:
Interoperability is a feature required by the Semantic Web. It is provided by the ontology matching methods and algorithms. But now ontologies are presented not only in English, but in other languages as well. It is important to use an automatic translation for obtaining correct matching pairs in multilingual ontology matching. The translation into many languages could be based on the Google Trans…
▽ More
Interoperability is a feature required by the Semantic Web. It is provided by the ontology matching methods and algorithms. But now ontologies are presented not only in English, but in other languages as well. It is important to use an automatic translation for obtaining correct matching pairs in multilingual ontology matching. The translation into many languages could be based on the Google Translate API, the Wiktionary database, etc. From the point of view of the balance of presence of many languages, of manually crafted translations, of a huge size of a dictionary, the most promising resource is the Wiktionary. It is a collaborative project working on the same principles as the Wikipedia. The parser of the Wiktionary was developed and the machine-readable dictionary was designed. The data of the machine-readable Wiktionary are stored in a relational database, but with the help of D2R server the database is presented as an RDF store. Thus, it is possible to get lexicographic information (definitions, translations, synonyms) from web service using SPARQL requests. In the case study, the problem entity is a task of multilingual ontology matching based on Wiktionary data accessible via SPARQL endpoint. Ontology matching results obtained using Wiktionary were compared with results based on Google Translate API.
△ Less
Submitted 26 October, 2011; v1 submitted 4 September, 2011;
originally announced September 2011.
-
Transformation of Wiktionary entry structure into tables and relations in a relational database schema
Authors:
A. A. Krizhanovsky
Abstract:
This paper addresses the question of automatic data extraction from the Wiktionary, which is a multilingual and multifunctional dictionary. Wiktionary is a collaborative project working on the same principles as the Wikipedia. The Wiktionary entry is a plain text from the text processing point of view. Wiktionary guidelines prescribe the entry layout and rules, which should be followed by editors…
▽ More
This paper addresses the question of automatic data extraction from the Wiktionary, which is a multilingual and multifunctional dictionary. Wiktionary is a collaborative project working on the same principles as the Wikipedia. The Wiktionary entry is a plain text from the text processing point of view. Wiktionary guidelines prescribe the entry layout and rules, which should be followed by editors of the dictionary. The presence of the structure of a Wiktionary article and formatting rules allows transforming the Wiktionary entry structure into tables and relations in a relational database schema, which is a part of a machine-readable dictionary (MRD). The paper describes how the flat text of the Wiktionary entry was extracted, converted, and stored in the specially designed relational database. The MRD contains the definitions, semantic relations, and translations extracted from the English and Russian Wiktionaries. The parser software is released under the open source license agreement (GPL), to facilitate its dissemination, modification and upgrades, to draw researchers and programmers into parsing other Wiktionaries, not only Russian and English.
△ Less
Submitted 5 November, 2010;
originally announced November 2010.
-
The comparison of Wiktionary thesauri transformed into the machine-readable format
Authors:
A. A. Krizhanovsky
Abstract:
Wiktionary is a unique, peculiar, valuable and original resource for natural language processing (NLP). The paper describes an open-source Wiktionary parser: its architecture and requirements followed by a description of Wiktionary features to be taken into account, some open problems of Wiktionary and the parser. The current implementation of the parser extracts the definitions, semantic relation…
▽ More
Wiktionary is a unique, peculiar, valuable and original resource for natural language processing (NLP). The paper describes an open-source Wiktionary parser: its architecture and requirements followed by a description of Wiktionary features to be taken into account, some open problems of Wiktionary and the parser. The current implementation of the parser extracts the definitions, semantic relations, and translations from English and Russian Wiktionaries. The paper's goal is to interest researchers (1) in using the constructed machine-readable dictionary for different NLP tasks, (2) in extending the software to parse 170 still unused Wiktionaries. The comparison of a number and types of semantic relations, a number of definitions, and a number of translations in the English Wiktionary and the Russian Wiktionary has been carried out. It was found that the number of semantic relations in the English Wiktionary is larger by 1.57 times than in Russian (157 and 100 thousands). But the Russian Wiktionary has more "rich" entries (with a big number of semantic relations), e.g. the number of entries with three or more semantic relations is larger by 1.63 times than in the English Wiktionary. Upon comparison, it was found out the methodological shortcomings of the Wiktionary.
△ Less
Submitted 25 June, 2010;
originally announced June 2010.
-
Related terms search based on WordNet / Wiktionary and its application in Ontology Matching
Authors:
A. A. Krizhanovsky,
Feiyu Lin
Abstract:
A set of ontology matching algorithms (for finding correspondences between concepts) is based on a thesaurus that provides the source data for the semantic distance calculations. In this wiki era, new resources may spring up and improve this kind of semantic search. In the paper a solution of this task based on Russian Wiktionary is compared to WordNet based algorithms. Metrics are estimated usi…
▽ More
A set of ontology matching algorithms (for finding correspondences between concepts) is based on a thesaurus that provides the source data for the semantic distance calculations. In this wiki era, new resources may spring up and improve this kind of semantic search. In the paper a solution of this task based on Russian Wiktionary is compared to WordNet based algorithms. Metrics are estimated using the test collection, containing 353 English word pairs with a relatedness score assigned by human evaluators. The experiment shows that the proposed method is capable in principle of calculating a semantic distance between pair of words in any language presented in Russian Wiktionary. The calculation of Wiktionary based metric had required the development of the open-source Wiktionary parser software.
△ Less
Submitted 12 October, 2009; v1 submitted 13 July, 2009;
originally announced July 2009.
-
Index wiki database: design and experiments
Authors:
A. A. Krizhanovsky
Abstract:
With the fantastic growth of Internet usage, information search in documents of a special type called a "wiki page" that is written using a simple markup language, has become an important problem. This paper describes the software architectural model for indexing wiki texts in three languages (Russian, English, and German) and the interaction between the software components (GATE, Lemmatizer, an…
▽ More
With the fantastic growth of Internet usage, information search in documents of a special type called a "wiki page" that is written using a simple markup language, has become an important problem. This paper describes the software architectural model for indexing wiki texts in three languages (Russian, English, and German) and the interaction between the software components (GATE, Lemmatizer, and Synarcher). The inverted file index database was designed using visual tool DBDesigner. The rules for parsing Wikipedia texts are illustrated by examples. Two index databases of Russian Wikipedia (RW) and Simple English Wikipedia (SEW) are built and compared. The size of RW is by order of magnitude higher than SEW (number of words, lexemes), though the growth rate of number of pages in SEW was found to be 14% higher than in Russian, and the rate of acquisition of new words in SEW lexicon was 7% higher during a period of five months (from September 2007 to February 2008). The Zipf's law was tested with both Russian and Simple Wikipedias. The entire source code of the indexing software and the generated index databases are freely available under GPL (GNU General Public License).
△ Less
Submitted 23 September, 2008; v1 submitted 12 August, 2008;
originally announced August 2008.
-
Information filtering based on wiki index database
Authors:
A. V. Smirnov,
A. A. Krizhanovsky
Abstract:
In this paper we present a profile-based approach to information filtering by an analysis of the content of text documents. The Wikipedia index database is created and used to automatically generate the user profile from the user document collection. The problem-oriented Wikipedia subcorpora are created (using knowledge extracted from the user profile) for each topic of user interests. The index…
▽ More
In this paper we present a profile-based approach to information filtering by an analysis of the content of text documents. The Wikipedia index database is created and used to automatically generate the user profile from the user document collection. The problem-oriented Wikipedia subcorpora are created (using knowledge extracted from the user profile) for each topic of user interests. The index databases of these subcorpora are applied to filtering information flow (e.g., mails, news). Thus, the analyzed texts are classified into several topics explicitly presented in the user profile. The paper concentrates on the indexing part of the approach. The architecture of an application implementing the Wikipedia indexing is described. The indexing method is evaluated using the Russian and Simple English Wikipedia.
△ Less
Submitted 8 May, 2008; v1 submitted 15 April, 2008;
originally announced April 2008.
-
Evaluation experiments on related terms search in Wikipedia: Information Content and Adapted HITS (In Russian)
Authors:
A. A. Krizhanovsky
Abstract:
The classification of metrics and algorithms search for related terms via WordNet, Roget's Thesaurus, and Wikipedia was extended to include adapted HITS algorithm. Evaluation experiments on Information Content and adapted HITS algorithm are described. The test collection of Russian word pairs with human-assigned similarity judgments is proposed.
-----
Klassifikacija metrik i algoritmov poisk…
▽ More
The classification of metrics and algorithms search for related terms via WordNet, Roget's Thesaurus, and Wikipedia was extended to include adapted HITS algorithm. Evaluation experiments on Information Content and adapted HITS algorithm are described. The test collection of Russian word pairs with human-assigned similarity judgments is proposed.
-----
Klassifikacija metrik i algoritmov poiska semanticheski blizkih slov v tezaurusah WordNet, Rozhe i jenciklopedii Vikipedija rasshirena adaptirovannym HITS algoritmom. S pomow'ju jeksperimentov v Vikipedii oceneny metrika Information Content i adaptirovannyj algoritm HITS. Predlozhen resurs dlja ocenki semanticheskoj blizosti russkih slov.
△ Less
Submitted 16 January, 2008; v1 submitted 1 October, 2007;
originally announced October 2007.
-
Context-sensitive access to e-document corpus
Authors:
A. V. Smirnov,
T. V. Levashova,
M. P. Pashkin,
N. G. Shilov,
A. A. Krizhanovsky,
A. M. Kashevnik,
A. S. Komarova
Abstract:
The methodology of context-sensitive access to e-documents considers context as a problem model based on the knowledge extracted from the application domain, and presented in the form of application ontology. Efficient access to an information in the text form is needed. Wiki resources as a modern text format provides huge number of text in a semi formalized structure. At the first stage of the…
▽ More
The methodology of context-sensitive access to e-documents considers context as a problem model based on the knowledge extracted from the application domain, and presented in the form of application ontology. Efficient access to an information in the text form is needed. Wiki resources as a modern text format provides huge number of text in a semi formalized structure. At the first stage of the methodology, documents are indexed against the ontology representing macro-situation. The indexing method uses a topic tree as a middle layer between documents and the application ontology. At the second stage documents relevant to the current situation (the abstract and operational contexts) are identified and sorted by degree of relevance. Abstract context is a problem-oriented ontology-based model. Operational context is an instantiation of the abstract context with data provided by the information sources. The following parts of the methodology are described: (i) metrics for measuring similarity of e-documents to ontology, (ii) a document index storing results of indexing of e-documents against the ontology; (iii) a method for identification of relevant e-documents based on semantic similarity measures. Wikipedia (wiki resource) is used as a corpus of e-documents for approach evaluation in a case study. Text categorization, the presence of metadata, and an existence of a lot of articles related to different topics characterize the corpus.
△ Less
Submitted 11 October, 2006;
originally announced October 2006.
-
Automatic forming lists of semantically related terms based on texts rating in the corpus with hyperlinks and categories (In Russian)
Authors:
A. Krizhanovsky
Abstract:
HITS adapted algorithm for synonym search, the program architecture, and the program work evaluation with test examples are presented in the paper. Synarcher program for synonym (and related terms) search in the text corpus of special structure (Wikipedia) was developed. The results of search are presented in the form of a graph. It is possible to explore the graph and search graph elements inte…
▽ More
HITS adapted algorithm for synonym search, the program architecture, and the program work evaluation with test examples are presented in the paper. Synarcher program for synonym (and related terms) search in the text corpus of special structure (Wikipedia) was developed. The results of search are presented in the form of a graph. It is possible to explore the graph and search graph elements interactively. The proposed algorithm could be applied to the search request extending and for synonym dictionary forming.
△ Less
Submitted 30 June, 2006;
originally announced June 2006.
-
Synonym search in Wikipedia: Synarcher
Authors:
A. Krizhanovsky
Abstract:
The program Synarcher for synonym (and related terms) search in the text corpus of special structure (Wikipedia) was developed. The results of the search are presented in the form of graph. It is possible to explore the graph and search for graph elements interactively. Adapted HITS algorithm for synonym search, program architecture, and program work evaluation with test examples are presented i…
▽ More
The program Synarcher for synonym (and related terms) search in the text corpus of special structure (Wikipedia) was developed. The results of the search are presented in the form of graph. It is possible to explore the graph and search for graph elements interactively. Adapted HITS algorithm for synonym search, program architecture, and program work evaluation with test examples are presented in the paper. The proposed algorithm can be applied to a query expansion by synonyms (in a search engine) and a synonym dictionary forming.
△ Less
Submitted 23 June, 2006; v1 submitted 22 June, 2006;
originally announced June 2006.
-
Ontology-Based Users & Requests Clustering in Customer Service Management System
Authors:
Alexander Smirnov,
Mikhail Pashkin,
Nikolai Chilov,
Tatiana Levashova,
Andrew Krizhanovsky,
Alexey Kashevnik
Abstract:
Customer Service Management is one of major business activities to better serve company customers through the introduction of reliable processes and procedures. Today this kind of activities is implemented through e-services to directly involve customers into business processes. Traditionally Customer Service Management involves application of data mining techniques to discover usage patterns fr…
▽ More
Customer Service Management is one of major business activities to better serve company customers through the introduction of reliable processes and procedures. Today this kind of activities is implemented through e-services to directly involve customers into business processes. Traditionally Customer Service Management involves application of data mining techniques to discover usage patterns from the company knowledge memory. Hence grou** of customers/requests to clusters is one of major technique to improve the level of company customization. The goal of this paper is to present an efficient for implementation approach for clustering users and their requests. The approach uses ontology as knowledge representation model to improve the semantic interoperability between units of the company and customers. Some fragments of the approach tested in an industrial company are also presented in the paper.
△ Less
Submitted 27 May, 2005; v1 submitted 26 January, 2005;
originally announced January 2005.