Skip to main content

Showing 1–28 of 28 results for author: Yamshchikov, I P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.03437  [pdf, other

    cs.CL cs.SI

    Knowledge Graph Representation for Political Information Sources

    Authors: Tinatin Osmonova, Alexey Tikhonov, Ivan P. Yamshchikov

    Abstract: With the rise of computational social science, many scholars utilize data analysis and natural language processing tools to analyze social media, news articles, and other accessible data sources for examining political and social discourse. Particularly, the study of the emergence of echo-chambers due to the dissemination of specific information has become a topic of interest in mixed methods rese… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  2. arXiv:2403.19423  [pdf, other

    cs.SI cs.CL

    Echo-chambers and Idea Labs: Communication Styles on Twitter

    Authors: Aleksandra Sorokovikova, Michael Becker, Ivan P. Yamshchikov

    Abstract: This paper investigates the communication styles and structures of Twitter (X) communities within the vaccination context. While mainstream research primarily focuses on the echo-chamber phenomenon, wherein certain ideas are reinforced and participants are isolated from opposing opinions, this study reveals the presence of diverse communication styles across various communities. In addition to the… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    ACM Class: J.4; K.4.1; K.4.2

  3. arXiv:2402.14890  [pdf, other

    cs.CL cs.AI cs.LG

    Vygotsky Distance: Measure for Benchmark Task Similarity

    Authors: Maxim K. Surkov, Ivan P. Yamshchikov

    Abstract: Evaluation plays a significant role in modern natural language processing. Most modern NLP benchmarks consist of arbitrary sets of tasks that neither guarantee any generalization potential for the model once applied outside the test set nor try to minimize the resource consumption needed for model evaluation. This paper presents a theoretical instrument and a practical algorithm to calculate simil… ▽ More

    Submitted 26 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    MSC Class: 68T01; 97P80; 97C30; 68Q32 ACM Class: H.1.1; I.2.4; I.2.6; F.2.0

  4. arXiv:2402.01765  [pdf, other

    cs.CL cs.AI

    LLMs Simulate Big Five Personality Traits: Further Evidence

    Authors: Aleksandra Sorokovikova, Natalia Fedorova, Sharwin Rezagholi, Ivan P. Yamshchikov

    Abstract: An empirical investigation into the simulation of the Big Five personality traits by large language models (LLMs), namely Llama2, GPT4, and Mixtral, is presented. We analyze the personality traits simulated by these models and their stability. This contributes to the broader understanding of the capabilities of LLMs to simulate personality traits and the respective implications for personalized hu… ▽ More

    Submitted 31 January, 2024; originally announced February 2024.

    ACM Class: I.2.7; J.4; I.2.1

  5. arXiv:2401.17827  [pdf, other

    cs.CL cs.AI

    Neural Machine Translation for Malayalam Paraphrase Generation

    Authors: Christeena Varghese, Sergey Koshelev, Ivan P. Yamshchikov

    Abstract: This study explores four methods of generating paraphrases in Malayalam, utilizing resources available for English paraphrasing and pre-trained Neural Machine Translation (NMT) models. We evaluate the resulting paraphrases using both automated metrics, such as BLEU, METEOR, and cosine similarity, as well as human annotation. Our findings suggest that automated evaluation measures may not be fully… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    ACM Class: I.7.0; I.2.7

  6. arXiv:2311.02049  [pdf, other

    cs.CL cs.AI

    Post Turing: Map** the landscape of LLM Evaluation

    Authors: Alexey Tikhonov, Ivan P. Yamshchikov

    Abstract: In the rapidly evolving landscape of Large Language Models (LLMs), introduction of well-defined and standardized evaluation methodologies remains a crucial challenge. This paper traces the historical trajectory of LLM evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research. We categorize the evolution of LLMs into distinct periods, each characterized by i… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Comments: Accepted for GEM @ EMNLP 2023

    MSC Class: 68T50 ACM Class: I.2.7

  7. Rehabilitating Homeless: Dataset and Key Insights

    Authors: Anna Bykova, Nikolay Filippov, Ivan P. Yamshchikov

    Abstract: This paper presents a large anonymized dataset of homelessness alongside insights into the data-driven rehabilitation of homeless people. The dataset was gathered by a large nonprofit organization working on rehabilitating the homeless for twenty years. This is the first dataset that we know of that contains rich information on thousands of homeless individuals seeking rehabilitation. We show how… ▽ More

    Submitted 10 February, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

    Comments: Dataset, code and appendix to this article are available at https://github.com/LEYADEV/homeless

  8. arXiv:2211.11041  [pdf, other

    cs.CL cs.AI cs.IT

    Pragmatic Constraint on Distributional Semantics

    Authors: Elizaveta Zhemchuzhina, Nikolai Filippov, Ivan P. Yamshchikov

    Abstract: This paper studies the limits of language models' statistical learning in the context of Zipf's law. First, we demonstrate that Zipf-law token distribution emerges irrespective of the chosen tokenization. Second, we show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics. Namely, the tokens that have a one-to-on… ▽ More

    Submitted 20 November, 2022; originally announced November 2022.

    ACM Class: E.4; H.1.1; I.2.7

  9. arXiv:2211.05673  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    BERT in Plutarch's Shadows

    Authors: Ivan P. Yamshchikov, Alexey Tikhonov, Yorgos Pantis, Charlotte Schubert, Jürgen Jost

    Abstract: The extensive surviving corpus of the ancient scholar Plutarch of Chaeronea (ca. 45-120 CE) also contains several texts which, according to current scholarly opinion, did not originate with him and are therefore attributed to an anonymous author Pseudo-Plutarch. These include, in particular, the work Placita Philosophorum (Quotations and Opinions of the Ancient Philosophers), which is extremely im… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    MSC Class: 68T50 ACM Class: I.2.7; J.5

  10. What is Wrong with Language Models that Can Not Tell a Story?

    Authors: Ivan P. Yamshchikov, Alexey Tikhonov

    Abstract: This paper argues that a deeper understanding of narrative and the successful generation of longer subjectively interesting texts is a vital bottleneck that hinders the progress in modern Natural Language Processing (NLP) and may even be in the whole field of Artificial Intelligence. We demonstrate that there are no adequate datasets, evaluation methods, and even operational concepts that could be… ▽ More

    Submitted 10 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

    MSC Class: 68T50 ACM Class: I.2.7; J.5

  11. arXiv:2208.02554  [pdf, other

    cs.CL

    Vocabulary Transfer for Medical Texts

    Authors: Vladislav D. Mosin, Ivan P. Yamshchikov

    Abstract: Vocabulary transfer is a transfer learning subtask in which language models fine-tune with the corpus-specific tokenization instead of the default one, which is being used during pretraining. This usually improves the resulting performance of the model, and in the paper, we demonstrate that vocabulary transfer is especially beneficial for medical text processing. Using three different medical natu… ▽ More

    Submitted 4 August, 2022; originally announced August 2022.

  12. arXiv:2202.03119  [pdf, ps, other

    cs.CL

    Moving Other Way: Exploring Word Mover Distance Extensions

    Authors: Ilya Smirnov, Ivan P. Yamshchikov

    Abstract: The word mover's distance (WMD) is a popular semantic similarity metric for two texts. This position paper studies several possible extensions of WMD. We experiment with the frequency of words in the corpus as a weighting factor and the geometry of the word vector space. We validate possible extensions of WMD on six document classification datasets. Some proposed extensions show better results in… ▽ More

    Submitted 8 February, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

    MSC Class: 49Q22 ACM Class: I.2.7

  13. Fine-Tuning Transformers: Vocabulary Transfer

    Authors: Vladislav Mosin, Igor Samenko, Alexey Tikhonov, Borislav Kozlovskii, Ivan P. Yamshchikov

    Abstract: Transformers are responsible for the vast majority of recent advances in natural language processing. The majority of practical natural language processing applications of these models are typically enabled through transfer learning. This paper studies if corpus-specific tokenization used for fine-tuning improves the resulting performance of the model. Through a series of experiments, we demonstra… ▽ More

    Submitted 12 December, 2022; v1 submitted 29 December, 2021; originally announced December 2021.

    MSC Class: 68T50; 91F20 ACM Class: I.2.7

  14. Do Data-based Curricula Work?

    Authors: Maxim K. Surkov, Vladislav D. Mosin, Ivan P. Yamshchikov

    Abstract: Current state-of-the-art NLP systems use large neural networks that require lots of computational resources for training. Inspired by human knowledge acquisition, researchers have proposed curriculum learning, - sequencing of tasks (task-based curricula) or ordering and sampling of the datasets (data-based curricula) that facilitate training. This work investigates the benefits of data-based curri… ▽ More

    Submitted 6 April, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

  15. StoryDB: Broad Multi-language Narrative Dataset

    Authors: Alexey Tikhonov, Igor Samenko, Ivan P. Yamshchikov

    Abstract: This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can s… ▽ More

    Submitted 29 September, 2021; originally announced September 2021.

    ACM Class: I.2.7

    Journal ref: In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems 2021 Nov (pp. 32-39)

  16. arXiv:2109.13855  [pdf, other

    cs.CL cs.AI cs.IR

    Actionable Entities Recognition Benchmark for Interactive Fiction

    Authors: Alexey Tikhonov, Ivan P. Yamshchikov

    Abstract: This paper presents a new natural language processing task - Actionable Entities Recognition (AER) - recognition of entities that protagonists could interact with for further plot development. Though similar to classical Named Entity Recognition (NER), it has profound differences. In particular, it is crucial for interactive fiction, where the agent needs to detect entities that might be useful in… ▽ More

    Submitted 16 November, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

    ACM Class: J.5; I.2.6; I.2.7

  17. arXiv:2109.11969  [pdf, other

    cs.CL cs.AI cs.HC

    Rethinking Crowd Sourcing for Semantic Similarity

    Authors: Shaul Solomon, Adam Cohn, Hernan Rosenblum, Chezi Hershkovitz, Ivan P. Yamshchikov

    Abstract: Estimation of semantic similarity is crucial for a variety of natural language processing (NLP) tasks. In the absence of a general theory of semantic information, many papers rely on human annotators as the source of ground truth for semantic similarity estimation. This paper investigates the ambiguities inherent in crowd-sourced semantic labeling. It shows that annotators that treat semantic simi… ▽ More

    Submitted 24 September, 2021; originally announced September 2021.

    ACM Class: I.2.7; H.5.2; K.6.1

  18. DYPLODOC: Dynamic Plots for Document Classification

    Authors: Anastasia Malysheva, Alexey Tikhonov, Ivan P. Yamshchikov

    Abstract: Narrative generation and analysis are still on the fringe of modern natural language processing yet are crucial in a variety of applications. This paper proposes a feature extraction method for plot dynamics. We present a dataset that consists of the plot descriptions for thirteen thousand TV shows alongside meta-information on their genres and dynamic plots extracted from them. We validate the pr… ▽ More

    Submitted 26 July, 2021; originally announced July 2021.

    ACM Class: I.2.7; I.2.6

    Journal ref: in Modern Management based on Big Data II and Machine Learning and Intelligent Systems III 2021 (pp. 511-519). IOS Press

  19. arXiv:2007.06290  [pdf, other

    cs.CL cs.AI cs.CY

    Paranoid Transformer: Reading Narrative of Madness as Computational Approach to Creativity

    Authors: Yana Agafonova, Alexey Tikhonov, Ivan P. Yamshchikov

    Abstract: This papers revisits the receptive theory in context of computational creativity. It presents a case study of a Paranoid Transformer - a fully autonomous text generation engine with raw output that could be read as the narrative of a mad digital persona without any additional human post-filtering. We describe technical details of the generative system, provide examples of output and discuss the im… ▽ More

    Submitted 13 July, 2020; originally announced July 2020.

    MSC Class: 68T50; 68T07; 91F20; 68T42 ACM Class: H.1.2; J.5; K.4

    Journal ref: Future Internet. 2020 Nov;12(11):182

  20. arXiv:2007.06284  [pdf, other

    eess.AS cs.LG cs.SD

    Artificial Neural Networks Jamming on the Beat

    Authors: Alexey Tikhonov, Ivan P. Yamshchikov

    Abstract: This paper addresses the issue of long-scale correlations that is characteristic for symbolic music and is a challenge for modern generative algorithms. It suggests a very simple workaround for this challenge, namely, generation of a drum pattern that could be further used as a foundation for melody generation. The paper presents a large dataset of drum patterns alongside with corresponding melodi… ▽ More

    Submitted 20 May, 2021; v1 submitted 13 July, 2020; originally announced July 2020.

    MSC Class: 68T07; 68T50 ACM Class: J.5; E.0

  21. arXiv:2004.12835  [pdf, other

    cs.CL cs.IR cs.LG

    Intuitive Contrasting Map for Antonym Embeddings

    Authors: Igor Samenko, Alexey Tikhonov, Ivan P. Yamshchikov

    Abstract: This paper shows that, modern word embeddings contain information that distinguishes synonyms and antonyms despite small cosine similarities between corresponding vectors. This information is encoded in the geometry of the embeddings and could be extracted with a straight-forward and intuitive manifold learning procedure or a contrasting map. Such a map is trained on a small labeled subset of the… ▽ More

    Submitted 7 September, 2021; v1 submitted 27 April, 2020; originally announced April 2020.

    MSC Class: 68T50; 68T35 ACM Class: I.2.7; E.4

    Journal ref: In Modern Management based on Big Data II and Machine Learning and Intelligent Systems III 2021 (pp. 502-510). IOS Press

  22. Style-transfer and Paraphrase: Looking for a Sensible Semantic Similarity Metric

    Authors: Ivan P. Yamshchikov, Viacheslav Shibaev, Nikolay Khlebnikov, Alexey Tikhonov

    Abstract: The rapid development of such natural language processing tasks as style transfer, paraphrase, and machine translation often calls for the use of semantic similarity metrics. In recent years a lot of methods to measure the semantic similarity of two short texts were developed. This paper provides a comprehensive analysis for more than a dozen of such methods. Using a new dataset of fourteen thousa… ▽ More

    Submitted 3 December, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

    MSC Class: 68Q55 ACM Class: H.1.1; E.4

  23. arXiv:2003.05758  [pdf, other

    cs.CL math.MG

    It Means More if It Sounds Good: Yet Another Hypothesis Concerning the Evolution of Polysemous Words

    Authors: Ivan P. Yamshchikov, Cyrille Merleau Nono Saha, Igor Samenko, Jürgen Jost

    Abstract: This position paper looks into the formation of language and shows ties between structural properties of the words in the English language and their polysemy. Using Ollivier-Ricci curvature over a large graph of synonyms to estimate polysemy it shows empirically that the words that arguably are easier to pronounce also tend to have multiple meanings.

    Submitted 7 January, 2021; v1 submitted 12 March, 2020; originally announced March 2020.

    MSC Class: 68U15; 68R10 ACM Class: G.2.2; J.5

  24. Decomposing Textual Information For Style Transfer

    Authors: Ivan P. Yamshchikov, Viacheslav Shibaev, Aleksander Nagaev, Jürgen Jost, Alexey Tikhonov

    Abstract: This paper focuses on latent representations that could effectively decompose different aspects of textual information. Using a framework of style transfer for texts, we propose several empirical methods to assess information decomposition quality. We validate these methods with several state-of-the-art textual style transfer methods. Higher quality of information decomposition corresponds to high… ▽ More

    Submitted 26 September, 2019; originally announced September 2019.

    Comments: arXiv admin note: substantial text overlap with arXiv:1908.06809

    Journal ref: EMNLP-IJCNLP 2019. 2019 Nov 4:128

  25. Style Transfer for Texts: Retrain, Report Errors, Compare with Rewrites

    Authors: Alexey Tikhonov, Viacheslav Shibaev, Aleksander Nagaev, Aigul Nugmanova, Ivan P. Yamshchikov

    Abstract: This paper shows that standard assessment methodology for style transfer has several significant problems. First, the standard metrics for style accuracy and semantics preservation vary significantly on different re-runs. Therefore one has to report error margins for the obtained results. Second, starting with certain values of bilingual evaluation understudy (BLEU) between input and output and ac… ▽ More

    Submitted 29 August, 2019; v1 submitted 19 August, 2019; originally announced August 2019.

    Journal ref: In Proceedings of EMNLP-IJCNLP 2019 Nov (pp. 3936-3945)

  26. arXiv:1808.04365  [pdf, ps, other

    cs.CL cs.AI

    What is wrong with style transfer for texts?

    Authors: Alexey Tikhonov, Ivan P. Yamshchikov

    Abstract: A number of recent machine learning papers work with an automated style transfer for texts and, counter to intuition, demonstrate that there is no consensus formulation of this NLP task. Different researchers propose different algorithms, datasets and target metrics to address it. This short opinion paper aims to discuss possible formalization of this NLP task in anticipation of a further growing… ▽ More

    Submitted 13 August, 2018; originally announced August 2018.

  27. arXiv:1807.07147  [pdf, other

    cs.CL cs.AI cs.LG

    Guess who? Multilingual approach for the automated generation of author-stylized poetry

    Authors: Alexey Tikhonov, Ivan P. Yamshchikov

    Abstract: This paper addresses the problem of stylized text generation in a multilingual setup. A version of a language model based on a long short-term memory (LSTM) artificial neural network with extended phonetic and semantic embeddings is used for stylized poetry generation. The quality of the resulting poems generated by the network is estimated through bilingual evaluation understudy (BLEU), a survey… ▽ More

    Submitted 17 September, 2018; v1 submitted 17 July, 2018; originally announced July 2018.

  28. Music generation with variational recurrent autoencoder supported by history

    Authors: Ivan P. Yamshchikov, Alexey Tikhonov

    Abstract: A new architecture of an artificial neural network that helps to generate longer melodic patterns is introduced alongside with methods for post-generation filtering. The proposed approach called variational autoencoder supported by history is based on a recurrent highway gated network combined with a variational autoencoder. Combination of this architecture with filtering heuristics allows generat… ▽ More

    Submitted 12 November, 2018; v1 submitted 15 May, 2017; originally announced May 2017.