Search | arXiv e-print repository

A Language Model for Grammatical Error Correction in L2 Russian

Authors: Nikita Remnev, Sergei Obiedkov, Ekaterina Rakhilina, Ivan Smirnov, Anastasia Vyrenkova

Abstract: Grammatical error correction is one of the fundamental tasks in Natural Language Processing. For the Russian language, most of the spellcheckers available correct typos and other simple errors with high accuracy, but often fail when faced with non-native (L2) writing, since the latter contains errors that are not typical for native speakers. In this paper, we propose a pipeline involving a languag… ▽ More Grammatical error correction is one of the fundamental tasks in Natural Language Processing. For the Russian language, most of the spellcheckers available correct typos and other simple errors with high accuracy, but often fail when faced with non-native (L2) writing, since the latter contains errors that are not typical for native speakers. In this paper, we propose a pipeline involving a language model intended for correcting errors in L2 Russian writing. The language model proposed is trained on untagged texts of the Newspaper subcorpus of the Russian National Corpus, and the quality of the model is validated against the RULEC-GEC corpus. △ Less

Submitted 4 July, 2023; originally announced July 2023.

arXiv:1911.06861 [pdf, other]

doi 10.1007/978-3-319-26123-2_27

Large-Scale Parallel Matching of Social Network Profiles

Authors: Alexander Panchenko, Dmitry Babaev, Sergei Obiedkov

Abstract: A profile matching algorithm takes as input a user profile of one social network and returns, if existing, the profile of the same person in another social network. Such methods have immediate applications in Internet marketing, search, security, and a number of other domains, which is why this topic saw a recent surge in popularity. In this paper, we present a user identity resolution approach th… ▽ More A profile matching algorithm takes as input a user profile of one social network and returns, if existing, the profile of the same person in another social network. Such methods have immediate applications in Internet marketing, search, security, and a number of other domains, which is why this topic saw a recent surge in popularity. In this paper, we present a user identity resolution approach that uses minimal supervision and achieves a precision of 0.98 at a recall of 0.54. Furthermore, the method is computationally efficient and easily parallelizable. We show that the method can be used to match Facebook, the most popular social network globally, with VKontakte, the most popular social network among Russian-speaking users. △ Less

Submitted 15 November, 2019; originally announced November 2019.

arXiv:1807.06149 [pdf, other]

doi 10.1016/j.dam.2019.02.036

Probably approximately correct learning of Horn envelopes from queries

Authors: Daniel Borchmann, Tom Hanika, Sergei Obiedkov

Abstract: We propose an algorithm for learning the Horn envelope of an arbitrary domain using an expert, or an oracle, capable of answering certain types of queries about this domain. Attribute exploration from formal concept analysis is a procedure that solves this problem, but the number of queries it may ask is exponential in the size of the resulting Horn formula in the worst case. We recall a well-know… ▽ More We propose an algorithm for learning the Horn envelope of an arbitrary domain using an expert, or an oracle, capable of answering certain types of queries about this domain. Attribute exploration from formal concept analysis is a procedure that solves this problem, but the number of queries it may ask is exponential in the size of the resulting Horn formula in the worst case. We recall a well-known polynomial-time algorithm for learning Horn formulas with membership and equivalence queries and modify it to obtain a polynomial-time probably approximately correct algorithm for learning the Horn envelope of an arbitrary domain. △ Less

Submitted 16 July, 2018; originally announced July 2018.

Comments: 21 pages, 1 figure

MSC Class: 03G10 68T27 ACM Class: F.4.1; I.2.6

Journal ref: Discrete Applied Mathematics Volume 273 (2020), Pages 30-42

arXiv:1804.05831 [pdf]

Neologisms on Facebook

Authors: Nikita Muravyev, Alexander Panchenko, Sergei Obiedkov

Abstract: In this paper, we present a study of neologisms and loan words frequently occurring in Facebook user posts. We have analyzed a dataset of several million publically available posts written during 2006-2013 by Russian-speaking Facebook users. From these, we have built a vocabulary of most frequent lemmatized words missing from the OpenCorpora dictionary the assumption being that many such words hav… ▽ More In this paper, we present a study of neologisms and loan words frequently occurring in Facebook user posts. We have analyzed a dataset of several million publically available posts written during 2006-2013 by Russian-speaking Facebook users. From these, we have built a vocabulary of most frequent lemmatized words missing from the OpenCorpora dictionary the assumption being that many such words have entered common use only recently. This assumption is certainly not true for all the words extracted in this way; for that reason, we manually filtered the automatically obtained list in order to exclude non-Russian or incorrectly lemmatized words, as well as words recorded by other dictionaries or those occurring in texts from the Russian National Corpus. The result is a list of 168 words that can potentially be considered neologisms. We present an attempt at an etymological classification of these neologisms (unsurprisingly, most of them have recently been borrowed from English, but there are also quite a few new words composed of previously borrowed stems) and identify various derivational patterns. We also classify words into several large thematic areas, "internet", "marketing", and "multimedia" being among those with the largest number of words. We believe that, together with the word base collected in the process, they can serve as a starting point in further studies of neologisms and lexical processes that lead to their acceptance into the mainstream language. △ Less

Submitted 13 April, 2018; originally announced April 2018.

Comments: in Russian

arXiv:1701.00877 [pdf, other]

doi 10.1007/978-3-319-59271-8_5

On the Usability of Probably Approximately Correct Implication Bases

Authors: Daniel Borchmann, Tom Hanika, Sergei Obiedkov

Abstract: We revisit the notion of probably approximately correct implication bases from the literature and present a first formulation in the language of formal concept analysis, with the goal to investigate whether such bases represent a suitable substitute for exact implication bases in practical use-cases. To this end, we quantitatively examine the behavior of probably approximately correct implication… ▽ More We revisit the notion of probably approximately correct implication bases from the literature and present a first formulation in the language of formal concept analysis, with the goal to investigate whether such bases represent a suitable substitute for exact implication bases in practical use-cases. To this end, we quantitatively examine the behavior of probably approximately correct implication bases on artificial and real-world data sets and compare their precision and recall with respect to their corresponding exact implication bases. Using a small example, we also provide qualitative insight that implications from probably approximately correct bases can still represent meaningful knowledge from a given data set. △ Less

Submitted 18 January, 2017; v1 submitted 3 January, 2017; originally announced January 2017.

Comments: 17 pages, 8 figures; typos added, corrected x-label on graphs

MSC Class: 03G10 68T27 ACM Class: F.4.1; I.2.6

Showing 1–5 of 5 results for author: Obiedkov, S