Search | arXiv e-print repository

Knowledge Graph Representation for Political Information Sources

Authors: Tinatin Osmonova, Alexey Tikhonov, Ivan P. Yamshchikov

Abstract: With the rise of computational social science, many scholars utilize data analysis and natural language processing tools to analyze social media, news articles, and other accessible data sources for examining political and social discourse. Particularly, the study of the emergence of echo-chambers due to the dissemination of specific information has become a topic of interest in mixed methods rese… ▽ More With the rise of computational social science, many scholars utilize data analysis and natural language processing tools to analyze social media, news articles, and other accessible data sources for examining political and social discourse. Particularly, the study of the emergence of echo-chambers due to the dissemination of specific information has become a topic of interest in mixed methods research areas. In this paper, we analyze data collected from two news portals, Breitbart News (BN) and New York Times (NYT) to prove the hypothesis that the formation of echo-chambers can be partially explained on the level of an individual information consumption rather than a collective topology of individuals' social networks. Our research findings are presented through knowledge graphs, utilizing a dataset spanning 11.5 years gathered from BN and NYT media portals. We demonstrate that the application of knowledge representation techniques to the aforementioned news streams highlights, contrary to common assumptions, shows relative "internal" neutrality of both sources and polarizing attitude towards a small fraction of entities. Additionally, we argue that such characteristics in information sources lead to fundamental disparities in audience worldviews, potentially acting as a catalyst for the formation of echo-chambers. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2403.19423 [pdf, other]

Echo-chambers and Idea Labs: Communication Styles on Twitter

Authors: Aleksandra Sorokovikova, Michael Becker, Ivan P. Yamshchikov

Abstract: This paper investigates the communication styles and structures of Twitter (X) communities within the vaccination context. While mainstream research primarily focuses on the echo-chamber phenomenon, wherein certain ideas are reinforced and participants are isolated from opposing opinions, this study reveals the presence of diverse communication styles across various communities. In addition to the… ▽ More This paper investigates the communication styles and structures of Twitter (X) communities within the vaccination context. While mainstream research primarily focuses on the echo-chamber phenomenon, wherein certain ideas are reinforced and participants are isolated from opposing opinions, this study reveals the presence of diverse communication styles across various communities. In addition to the communities exhibiting echo-chamber behavior, this research uncovers communities with distinct communication patterns. By shedding light on the nuanced nature of communication within social networks, this study emphasizes the significance of understanding the diversity of perspectives within online communities. △ Less

Submitted 28 March, 2024; originally announced March 2024.

ACM Class: J.4; K.4.1; K.4.2

arXiv:2402.14890 [pdf, other]

Vygotsky Distance: Measure for Benchmark Task Similarity

Authors: Maxim K. Surkov, Ivan P. Yamshchikov

Abstract: Evaluation plays a significant role in modern natural language processing. Most modern NLP benchmarks consist of arbitrary sets of tasks that neither guarantee any generalization potential for the model once applied outside the test set nor try to minimize the resource consumption needed for model evaluation. This paper presents a theoretical instrument and a practical algorithm to calculate simil… ▽ More Evaluation plays a significant role in modern natural language processing. Most modern NLP benchmarks consist of arbitrary sets of tasks that neither guarantee any generalization potential for the model once applied outside the test set nor try to minimize the resource consumption needed for model evaluation. This paper presents a theoretical instrument and a practical algorithm to calculate similarity between benchmark tasks, we call this similarity measure "Vygotsky distance". The core idea of this similarity measure is that it is based on relative performance of the "students" on a given task, rather that on the properties of the task itself. If two tasks are close to each other in terms of Vygotsky distance the models tend to have similar relative performance on them. Thus knowing Vygotsky distance between tasks one can significantly reduce the number of evaluation tasks while maintaining a high validation quality. Experiments on various benchmarks, including GLUE, SuperGLUE, CLUE, and RussianSuperGLUE, demonstrate that a vast majority of NLP benchmarks could be at least 40% smaller in terms of the tasks included. Most importantly, Vygotsky distance could also be used for the validation of new tasks thus increasing the generalization potential of the future NLP models. △ Less

Submitted 26 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

MSC Class: 68T01; 97P80; 97C30; 68Q32 ACM Class: H.1.1; I.2.4; I.2.6; F.2.0

arXiv:2402.01765 [pdf, other]

LLMs Simulate Big Five Personality Traits: Further Evidence

Authors: Aleksandra Sorokovikova, Natalia Fedorova, Sharwin Rezagholi, Ivan P. Yamshchikov

Abstract: An empirical investigation into the simulation of the Big Five personality traits by large language models (LLMs), namely Llama2, GPT4, and Mixtral, is presented. We analyze the personality traits simulated by these models and their stability. This contributes to the broader understanding of the capabilities of LLMs to simulate personality traits and the respective implications for personalized hu… ▽ More An empirical investigation into the simulation of the Big Five personality traits by large language models (LLMs), namely Llama2, GPT4, and Mixtral, is presented. We analyze the personality traits simulated by these models and their stability. This contributes to the broader understanding of the capabilities of LLMs to simulate personality traits and the respective implications for personalized human-computer interaction. △ Less

Submitted 31 January, 2024; originally announced February 2024.

ACM Class: I.2.7; J.4; I.2.1

arXiv:2401.17827 [pdf, other]

Neural Machine Translation for Malayalam Paraphrase Generation

Authors: Christeena Varghese, Sergey Koshelev, Ivan P. Yamshchikov

Abstract: This study explores four methods of generating paraphrases in Malayalam, utilizing resources available for English paraphrasing and pre-trained Neural Machine Translation (NMT) models. We evaluate the resulting paraphrases using both automated metrics, such as BLEU, METEOR, and cosine similarity, as well as human annotation. Our findings suggest that automated evaluation measures may not be fully… ▽ More This study explores four methods of generating paraphrases in Malayalam, utilizing resources available for English paraphrasing and pre-trained Neural Machine Translation (NMT) models. We evaluate the resulting paraphrases using both automated metrics, such as BLEU, METEOR, and cosine similarity, as well as human annotation. Our findings suggest that automated evaluation measures may not be fully appropriate for Malayalam, as they do not consistently align with human judgment. This discrepancy underscores the need for more nuanced paraphrase evaluation approaches especially for highly agglutinative languages. △ Less

Submitted 31 January, 2024; originally announced January 2024.

ACM Class: I.7.0; I.2.7

arXiv:2311.02049 [pdf, other]

Post Turing: Map** the landscape of LLM Evaluation

Authors: Alexey Tikhonov, Ivan P. Yamshchikov

Abstract: In the rapidly evolving landscape of Large Language Models (LLMs), introduction of well-defined and standardized evaluation methodologies remains a crucial challenge. This paper traces the historical trajectory of LLM evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research. We categorize the evolution of LLMs into distinct periods, each characterized by i… ▽ More In the rapidly evolving landscape of Large Language Models (LLMs), introduction of well-defined and standardized evaluation methodologies remains a crucial challenge. This paper traces the historical trajectory of LLM evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research. We categorize the evolution of LLMs into distinct periods, each characterized by its unique benchmarks and evaluation criteria. As LLMs increasingly mimic human-like behaviors, traditional evaluation proxies, such as the Turing test, have become less reliable. We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models. Through an analysis of common evaluation methodologies, we advocate for a qualitative shift in assessment approaches, underscoring the importance of standardization and objective criteria. This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit. △ Less

Submitted 3 November, 2023; originally announced November 2023.

Comments: Accepted for GEM @ EMNLP 2023

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2302.04455 [pdf, other]

doi 10.1609/aaai.v37i12.26654

Rehabilitating Homeless: Dataset and Key Insights

Authors: Anna Bykova, Nikolay Filippov, Ivan P. Yamshchikov

Abstract: This paper presents a large anonymized dataset of homelessness alongside insights into the data-driven rehabilitation of homeless people. The dataset was gathered by a large nonprofit organization working on rehabilitating the homeless for twenty years. This is the first dataset that we know of that contains rich information on thousands of homeless individuals seeking rehabilitation. We show how… ▽ More This paper presents a large anonymized dataset of homelessness alongside insights into the data-driven rehabilitation of homeless people. The dataset was gathered by a large nonprofit organization working on rehabilitating the homeless for twenty years. This is the first dataset that we know of that contains rich information on thousands of homeless individuals seeking rehabilitation. We show how data analysis can help to make the rehabilitation of homeless people more effective and successful. Thus, we hope this paper alerts the data science community to the problem of homelessness. △ Less

Submitted 10 February, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

Comments: Dataset, code and appendix to this article are available at https://github.com/LEYADEV/homeless

arXiv:2211.11041 [pdf, other]

Pragmatic Constraint on Distributional Semantics

Authors: Elizaveta Zhemchuzhina, Nikolai Filippov, Ivan P. Yamshchikov

Abstract: This paper studies the limits of language models' statistical learning in the context of Zipf's law. First, we demonstrate that Zipf-law token distribution emerges irrespective of the chosen tokenization. Second, we show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics. Namely, the tokens that have a one-to-on… ▽ More This paper studies the limits of language models' statistical learning in the context of Zipf's law. First, we demonstrate that Zipf-law token distribution emerges irrespective of the chosen tokenization. Second, we show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics. Namely, the tokens that have a one-to-one correspondence with one semantic concept have different statistical properties than those with semantic ambiguity. Finally, we demonstrate how these properties interfere with statistical learning procedures motivated by distributional semantics. △ Less

Submitted 20 November, 2022; originally announced November 2022.

ACM Class: E.4; H.1.1; I.2.7

arXiv:2211.05673 [pdf, other]

doi 10.18653/v1/2022.emnlp-main.407

BERT in Plutarch's Shadows

Authors: Ivan P. Yamshchikov, Alexey Tikhonov, Yorgos Pantis, Charlotte Schubert, Jürgen Jost

Abstract: The extensive surviving corpus of the ancient scholar Plutarch of Chaeronea (ca. 45-120 CE) also contains several texts which, according to current scholarly opinion, did not originate with him and are therefore attributed to an anonymous author Pseudo-Plutarch. These include, in particular, the work Placita Philosophorum (Quotations and Opinions of the Ancient Philosophers), which is extremely im… ▽ More The extensive surviving corpus of the ancient scholar Plutarch of Chaeronea (ca. 45-120 CE) also contains several texts which, according to current scholarly opinion, did not originate with him and are therefore attributed to an anonymous author Pseudo-Plutarch. These include, in particular, the work Placita Philosophorum (Quotations and Opinions of the Ancient Philosophers), which is extremely important for the history of ancient philosophy. Little is known about the identity of that anonymous author and its relation to other authors from the same period. This paper presents a BERT language model for Ancient Greek. The model discovers previously unknown statistical properties relevant to these literary, philosophical, and historical problems and can shed new light on this authorship question. In particular, the Placita Philosophorum, together with one of the other Pseudo-Plutarch texts, shows similarities with the texts written by authors from an Alexandrian context (2nd/3rd century CE). △ Less

Submitted 10 November, 2022; originally announced November 2022.

MSC Class: 68T50 ACM Class: I.2.7; J.5

arXiv:2211.05044 [pdf, other]

doi 10.18653/v1/2023.wnu-1.8

What is Wrong with Language Models that Can Not Tell a Story?

Authors: Ivan P. Yamshchikov, Alexey Tikhonov

Abstract: This paper argues that a deeper understanding of narrative and the successful generation of longer subjectively interesting texts is a vital bottleneck that hinders the progress in modern Natural Language Processing (NLP) and may even be in the whole field of Artificial Intelligence. We demonstrate that there are no adequate datasets, evaluation methods, and even operational concepts that could be… ▽ More This paper argues that a deeper understanding of narrative and the successful generation of longer subjectively interesting texts is a vital bottleneck that hinders the progress in modern Natural Language Processing (NLP) and may even be in the whole field of Artificial Intelligence. We demonstrate that there are no adequate datasets, evaluation methods, and even operational concepts that could be used to start working on narrative processing. △ Less

Submitted 10 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

MSC Class: 68T50 ACM Class: I.2.7; J.5

arXiv:2208.02554 [pdf, other]

Vocabulary Transfer for Medical Texts

Authors: Vladislav D. Mosin, Ivan P. Yamshchikov

Abstract: Vocabulary transfer is a transfer learning subtask in which language models fine-tune with the corpus-specific tokenization instead of the default one, which is being used during pretraining. This usually improves the resulting performance of the model, and in the paper, we demonstrate that vocabulary transfer is especially beneficial for medical text processing. Using three different medical natu… ▽ More Vocabulary transfer is a transfer learning subtask in which language models fine-tune with the corpus-specific tokenization instead of the default one, which is being used during pretraining. This usually improves the resulting performance of the model, and in the paper, we demonstrate that vocabulary transfer is especially beneficial for medical text processing. Using three different medical natural language processing datasets, we show vocabulary transfer to provide up to ten extra percentage points for the downstream classifier accuracy. △ Less

Submitted 4 August, 2022; originally announced August 2022.

arXiv:2202.03119 [pdf, ps, other]

Moving Other Way: Exploring Word Mover Distance Extensions

Authors: Ilya Smirnov, Ivan P. Yamshchikov

Abstract: The word mover's distance (WMD) is a popular semantic similarity metric for two texts. This position paper studies several possible extensions of WMD. We experiment with the frequency of words in the corpus as a weighting factor and the geometry of the word vector space. We validate possible extensions of WMD on six document classification datasets. Some proposed extensions show better results in… ▽ More The word mover's distance (WMD) is a popular semantic similarity metric for two texts. This position paper studies several possible extensions of WMD. We experiment with the frequency of words in the corpus as a weighting factor and the geometry of the word vector space. We validate possible extensions of WMD on six document classification datasets. Some proposed extensions show better results in terms of the k-nearest neighbor classification error than WMD. △ Less

Submitted 8 February, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

MSC Class: 49Q22 ACM Class: I.2.7

arXiv:2112.14569 [pdf, other]

doi 10.1016/j.artint.2023.103860

Fine-Tuning Transformers: Vocabulary Transfer

Authors: Vladislav Mosin, Igor Samenko, Alexey Tikhonov, Borislav Kozlovskii, Ivan P. Yamshchikov

Abstract: Transformers are responsible for the vast majority of recent advances in natural language processing. The majority of practical natural language processing applications of these models are typically enabled through transfer learning. This paper studies if corpus-specific tokenization used for fine-tuning improves the resulting performance of the model. Through a series of experiments, we demonstra… ▽ More Transformers are responsible for the vast majority of recent advances in natural language processing. The majority of practical natural language processing applications of these models are typically enabled through transfer learning. This paper studies if corpus-specific tokenization used for fine-tuning improves the resulting performance of the model. Through a series of experiments, we demonstrate that such tokenization combined with the initialization and fine-tuning strategy for the vocabulary tokens speeds up the transfer and boosts the performance of the fine-tuned model. We call this aspect of transfer facilitation vocabulary transfer. △ Less

Submitted 12 December, 2022; v1 submitted 29 December, 2021; originally announced December 2021.

MSC Class: 68T50; 91F20 ACM Class: I.2.7

arXiv:2112.06510 [pdf, other]

doi 10.18653/v1/2022.insights-1.16

Do Data-based Curricula Work?

Authors: Maxim K. Surkov, Vladislav D. Mosin, Ivan P. Yamshchikov

Abstract: Current state-of-the-art NLP systems use large neural networks that require lots of computational resources for training. Inspired by human knowledge acquisition, researchers have proposed curriculum learning, - sequencing of tasks (task-based curricula) or ordering and sampling of the datasets (data-based curricula) that facilitate training. This work investigates the benefits of data-based curri… ▽ More Current state-of-the-art NLP systems use large neural networks that require lots of computational resources for training. Inspired by human knowledge acquisition, researchers have proposed curriculum learning, - sequencing of tasks (task-based curricula) or ordering and sampling of the datasets (data-based curricula) that facilitate training. This work investigates the benefits of data-based curriculum learning for large modern language models such as BERT and T5. We experiment with various curricula based on a range of complexity measures and different sampling strategies. Extensive experiments on different NLP tasks show that curricula based on various complexity measures rarely has any benefits while random sampling performs either as well or better than curricula. △ Less

Submitted 6 April, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

arXiv:2109.14396 [pdf, other]

doi 10.18653/v1/2021.eval4nlp-1.4

StoryDB: Broad Multi-language Narrative Dataset

Authors: Alexey Tikhonov, Igor Samenko, Ivan P. Yamshchikov

Abstract: This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can s… ▽ More This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa. △ Less

Submitted 29 September, 2021; originally announced September 2021.

ACM Class: I.2.7

Journal ref: In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems 2021 Nov (pp. 32-39)

arXiv:2109.13855 [pdf, other]

Actionable Entities Recognition Benchmark for Interactive Fiction

Authors: Alexey Tikhonov, Ivan P. Yamshchikov

Abstract: This paper presents a new natural language processing task - Actionable Entities Recognition (AER) - recognition of entities that protagonists could interact with for further plot development. Though similar to classical Named Entity Recognition (NER), it has profound differences. In particular, it is crucial for interactive fiction, where the agent needs to detect entities that might be useful in… ▽ More This paper presents a new natural language processing task - Actionable Entities Recognition (AER) - recognition of entities that protagonists could interact with for further plot development. Though similar to classical Named Entity Recognition (NER), it has profound differences. In particular, it is crucial for interactive fiction, where the agent needs to detect entities that might be useful in the future. We also discuss if AER might be further helpful for the systems dealing with narrative processing since actionable entities profoundly impact the causal relationship in a story. We validate the proposed task on two previously available datasets and present a new benchmark dataset for the AER task that includes 5550 descriptions with one or more actionable entities. △ Less

Submitted 16 November, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

ACM Class: J.5; I.2.6; I.2.7

arXiv:2109.11969 [pdf, other]

Rethinking Crowd Sourcing for Semantic Similarity

Authors: Shaul Solomon, Adam Cohn, Hernan Rosenblum, Chezi Hershkovitz, Ivan P. Yamshchikov

Abstract: Estimation of semantic similarity is crucial for a variety of natural language processing (NLP) tasks. In the absence of a general theory of semantic information, many papers rely on human annotators as the source of ground truth for semantic similarity estimation. This paper investigates the ambiguities inherent in crowd-sourced semantic labeling. It shows that annotators that treat semantic simi… ▽ More Estimation of semantic similarity is crucial for a variety of natural language processing (NLP) tasks. In the absence of a general theory of semantic information, many papers rely on human annotators as the source of ground truth for semantic similarity estimation. This paper investigates the ambiguities inherent in crowd-sourced semantic labeling. It shows that annotators that treat semantic similarity as a binary category (two sentences are either similar or not similar and there is no middle ground) play the most important role in the labeling. The paper offers heuristics to filter out unreliable annotators and stimulates further discussions on human perception of semantic similarity. △ Less

Submitted 24 September, 2021; originally announced September 2021.

ACM Class: I.2.7; H.5.2; K.6.1

arXiv:2107.12226 [pdf, other]

doi 10.3233/FAIA210283

DYPLODOC: Dynamic Plots for Document Classification

Authors: Anastasia Malysheva, Alexey Tikhonov, Ivan P. Yamshchikov

Abstract: Narrative generation and analysis are still on the fringe of modern natural language processing yet are crucial in a variety of applications. This paper proposes a feature extraction method for plot dynamics. We present a dataset that consists of the plot descriptions for thirteen thousand TV shows alongside meta-information on their genres and dynamic plots extracted from them. We validate the pr… ▽ More Narrative generation and analysis are still on the fringe of modern natural language processing yet are crucial in a variety of applications. This paper proposes a feature extraction method for plot dynamics. We present a dataset that consists of the plot descriptions for thirteen thousand TV shows alongside meta-information on their genres and dynamic plots extracted from them. We validate the proposed tool for plot dynamics extraction and discuss possible applications of this method to the tasks of narrative analysis and generation. △ Less

Submitted 26 July, 2021; originally announced July 2021.

ACM Class: I.2.7; I.2.6

Journal ref: in Modern Management based on Big Data II and Machine Learning and Intelligent Systems III 2021 (pp. 511-519). IOS Press

arXiv:2007.06290 [pdf, other]

doi 10.3390/fi12110182

Paranoid Transformer: Reading Narrative of Madness as Computational Approach to Creativity

Authors: Yana Agafonova, Alexey Tikhonov, Ivan P. Yamshchikov

Abstract: This papers revisits the receptive theory in context of computational creativity. It presents a case study of a Paranoid Transformer - a fully autonomous text generation engine with raw output that could be read as the narrative of a mad digital persona without any additional human post-filtering. We describe technical details of the generative system, provide examples of output and discuss the im… ▽ More This papers revisits the receptive theory in context of computational creativity. It presents a case study of a Paranoid Transformer - a fully autonomous text generation engine with raw output that could be read as the narrative of a mad digital persona without any additional human post-filtering. We describe technical details of the generative system, provide examples of output and discuss the impact of receptive theory, chance discovery and simulation of fringe mental state on the understanding of computational creativity. △ Less

Submitted 13 July, 2020; originally announced July 2020.

MSC Class: 68T50; 68T07; 91F20; 68T42 ACM Class: H.1.2; J.5; K.4

Journal ref: Future Internet. 2020 Nov;12(11):182

arXiv:2007.06284 [pdf, other]

doi 10.5220/0010461200370044

Artificial Neural Networks Jamming on the Beat

Authors: Alexey Tikhonov, Ivan P. Yamshchikov

Abstract: This paper addresses the issue of long-scale correlations that is characteristic for symbolic music and is a challenge for modern generative algorithms. It suggests a very simple workaround for this challenge, namely, generation of a drum pattern that could be further used as a foundation for melody generation. The paper presents a large dataset of drum patterns alongside with corresponding melodi… ▽ More This paper addresses the issue of long-scale correlations that is characteristic for symbolic music and is a challenge for modern generative algorithms. It suggests a very simple workaround for this challenge, namely, generation of a drum pattern that could be further used as a foundation for melody generation. The paper presents a large dataset of drum patterns alongside with corresponding melodies. It explores two possible methods for drum pattern generation. Exploring a latent space of drum patterns one could generate new drum patterns with a given music style. Finally, the paper demonstrates that a simple artificial neural network could be trained to generate melodies corresponding with these drum patters used as inputs. Resulting system could be used for end-to-end generation of symbolic music with song-like structure and higher long-scale correlations between the notes. △ Less

Submitted 20 May, 2021; v1 submitted 13 July, 2020; originally announced July 2020.

MSC Class: 68T07; 68T50 ACM Class: J.5; E.0

arXiv:2004.12835 [pdf, other]

doi 10.3233/FAIA210282

Intuitive Contrasting Map for Antonym Embeddings

Authors: Igor Samenko, Alexey Tikhonov, Ivan P. Yamshchikov

Abstract: This paper shows that, modern word embeddings contain information that distinguishes synonyms and antonyms despite small cosine similarities between corresponding vectors. This information is encoded in the geometry of the embeddings and could be extracted with a straight-forward and intuitive manifold learning procedure or a contrasting map. Such a map is trained on a small labeled subset of the… ▽ More This paper shows that, modern word embeddings contain information that distinguishes synonyms and antonyms despite small cosine similarities between corresponding vectors. This information is encoded in the geometry of the embeddings and could be extracted with a straight-forward and intuitive manifold learning procedure or a contrasting map. Such a map is trained on a small labeled subset of the data and can produce new embeddings that explicitly highlight specific semantic attributes of the word. The new embeddings produced by the map are shown to improve the performance on downstream tasks. △ Less

Submitted 7 September, 2021; v1 submitted 27 April, 2020; originally announced April 2020.

MSC Class: 68T50; 68T35 ACM Class: I.2.7; E.4

Journal ref: In Modern Management based on Big Data II and Machine Learning and Intelligent Systems III 2021 (pp. 502-510). IOS Press

arXiv:2004.05001 [pdf, other]

doi 10.1609/aaai.v35i16.17672

Style-transfer and Paraphrase: Looking for a Sensible Semantic Similarity Metric

Authors: Ivan P. Yamshchikov, Viacheslav Shibaev, Nikolay Khlebnikov, Alexey Tikhonov

Abstract: The rapid development of such natural language processing tasks as style transfer, paraphrase, and machine translation often calls for the use of semantic similarity metrics. In recent years a lot of methods to measure the semantic similarity of two short texts were developed. This paper provides a comprehensive analysis for more than a dozen of such methods. Using a new dataset of fourteen thousa… ▽ More The rapid development of such natural language processing tasks as style transfer, paraphrase, and machine translation often calls for the use of semantic similarity metrics. In recent years a lot of methods to measure the semantic similarity of two short texts were developed. This paper provides a comprehensive analysis for more than a dozen of such methods. Using a new dataset of fourteen thousand sentence pairs human-labeled according to their semantic similarity, we demonstrate that none of the metrics widely used in the literature is close enough to human judgment in these tasks. A number of recently proposed metrics provide comparable results, yet Word Mover Distance is shown to be the most reasonable solution to measure semantic similarity in reformulated texts at the moment. △ Less

Submitted 3 December, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

MSC Class: 68Q55 ACM Class: H.1.1; E.4

arXiv:2003.05758 [pdf, other]

It Means More if It Sounds Good: Yet Another Hypothesis Concerning the Evolution of Polysemous Words

Authors: Ivan P. Yamshchikov, Cyrille Merleau Nono Saha, Igor Samenko, Jürgen Jost

Abstract: This position paper looks into the formation of language and shows ties between structural properties of the words in the English language and their polysemy. Using Ollivier-Ricci curvature over a large graph of synonyms to estimate polysemy it shows empirically that the words that arguably are easier to pronounce also tend to have multiple meanings. This position paper looks into the formation of language and shows ties between structural properties of the words in the English language and their polysemy. Using Ollivier-Ricci curvature over a large graph of synonyms to estimate polysemy it shows empirically that the words that arguably are easier to pronounce also tend to have multiple meanings. △ Less

Submitted 7 January, 2021; v1 submitted 12 March, 2020; originally announced March 2020.

MSC Class: 68U15; 68R10 ACM Class: G.2.2; J.5

arXiv:1909.12928 [pdf, other]

doi 10.18653/v1/D19-5613

Decomposing Textual Information For Style Transfer

Authors: Ivan P. Yamshchikov, Viacheslav Shibaev, Aleksander Nagaev, Jürgen Jost, Alexey Tikhonov

Abstract: This paper focuses on latent representations that could effectively decompose different aspects of textual information. Using a framework of style transfer for texts, we propose several empirical methods to assess information decomposition quality. We validate these methods with several state-of-the-art textual style transfer methods. Higher quality of information decomposition corresponds to high… ▽ More This paper focuses on latent representations that could effectively decompose different aspects of textual information. Using a framework of style transfer for texts, we propose several empirical methods to assess information decomposition quality. We validate these methods with several state-of-the-art textual style transfer methods. Higher quality of information decomposition corresponds to higher performance in terms of bilingual evaluation understudy (BLEU) between output and human-written reformulations. △ Less

Submitted 26 September, 2019; originally announced September 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1908.06809

Journal ref: EMNLP-IJCNLP 2019. 2019 Nov 4:128

arXiv:1908.06809 [pdf, other]

doi 10.18653/v1/D19-1406

Style Transfer for Texts: Retrain, Report Errors, Compare with Rewrites

Authors: Alexey Tikhonov, Viacheslav Shibaev, Aleksander Nagaev, Aigul Nugmanova, Ivan P. Yamshchikov

Abstract: This paper shows that standard assessment methodology for style transfer has several significant problems. First, the standard metrics for style accuracy and semantics preservation vary significantly on different re-runs. Therefore one has to report error margins for the obtained results. Second, starting with certain values of bilingual evaluation understudy (BLEU) between input and output and ac… ▽ More This paper shows that standard assessment methodology for style transfer has several significant problems. First, the standard metrics for style accuracy and semantics preservation vary significantly on different re-runs. Therefore one has to report error margins for the obtained results. Second, starting with certain values of bilingual evaluation understudy (BLEU) between input and output and accuracy of the sentiment transfer the optimization of these two standard metrics diverge from the intuitive goal of the style transfer task. Finally, due to the nature of the task itself, there is a specific dependence between these two metrics that could be easily manipulated. Under these circumstances, we suggest taking BLEU between input and human-written reformulations into consideration for benchmarks. We also propose three new architectures that outperform state of the art in terms of this metric. △ Less

Submitted 29 August, 2019; v1 submitted 19 August, 2019; originally announced August 2019.

Journal ref: In Proceedings of EMNLP-IJCNLP 2019 Nov (pp. 3936-3945)

arXiv:1808.04365 [pdf, ps, other]

What is wrong with style transfer for texts?

Authors: Alexey Tikhonov, Ivan P. Yamshchikov

Abstract: A number of recent machine learning papers work with an automated style transfer for texts and, counter to intuition, demonstrate that there is no consensus formulation of this NLP task. Different researchers propose different algorithms, datasets and target metrics to address it. This short opinion paper aims to discuss possible formalization of this NLP task in anticipation of a further growing… ▽ More A number of recent machine learning papers work with an automated style transfer for texts and, counter to intuition, demonstrate that there is no consensus formulation of this NLP task. Different researchers propose different algorithms, datasets and target metrics to address it. This short opinion paper aims to discuss possible formalization of this NLP task in anticipation of a further growing interest to it. △ Less

Submitted 13 August, 2018; originally announced August 2018.

arXiv:1807.07147 [pdf, other]

doi 10.1109/SLT.2018.8639573

Guess who? Multilingual approach for the automated generation of author-stylized poetry

Authors: Alexey Tikhonov, Ivan P. Yamshchikov

Abstract: This paper addresses the problem of stylized text generation in a multilingual setup. A version of a language model based on a long short-term memory (LSTM) artificial neural network with extended phonetic and semantic embeddings is used for stylized poetry generation. The quality of the resulting poems generated by the network is estimated through bilingual evaluation understudy (BLEU), a survey… ▽ More This paper addresses the problem of stylized text generation in a multilingual setup. A version of a language model based on a long short-term memory (LSTM) artificial neural network with extended phonetic and semantic embeddings is used for stylized poetry generation. The quality of the resulting poems generated by the network is estimated through bilingual evaluation understudy (BLEU), a survey and a new cross-entropy based metric that is suggested for the problems of such type. The experiments show that the proposed model consistently outperforms random sample and vanilla-LSTM baselines, humans also tend to associate machine generated texts with the target author. △ Less

Submitted 17 September, 2018; v1 submitted 17 July, 2018; originally announced July 2018.

arXiv:1705.05458 [pdf, other]

doi 10.1007/s42452-020-03715-w

Music generation with variational recurrent autoencoder supported by history

Authors: Ivan P. Yamshchikov, Alexey Tikhonov

Abstract: A new architecture of an artificial neural network that helps to generate longer melodic patterns is introduced alongside with methods for post-generation filtering. The proposed approach called variational autoencoder supported by history is based on a recurrent highway gated network combined with a variational autoencoder. Combination of this architecture with filtering heuristics allows generat… ▽ More A new architecture of an artificial neural network that helps to generate longer melodic patterns is introduced alongside with methods for post-generation filtering. The proposed approach called variational autoencoder supported by history is based on a recurrent highway gated network combined with a variational autoencoder. Combination of this architecture with filtering heuristics allows generating pseudo-live acoustically pleasing and melodically diverse music. △ Less

Submitted 12 November, 2018; v1 submitted 15 May, 2017; originally announced May 2017.

Showing 1–28 of 28 results for author: Yamshchikov, I P