-
Exploring language relations through syntactic distances and geographic proximity
Authors:
Juan De Gregorio,
Raúl Toral,
David Sánchez
Abstract:
Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Un…
▽ More
Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Entropy estimators for Markovian sequences: A comparative analysis
Authors:
Juan De Gregorio,
David Sanchez,
Raul Toral
Abstract:
Entropy estimation is a fundamental problem in information theory that has applications in various fields, including physics, biology, and computer science. Estimating the entropy of discrete sequences can be challenging due to limited data and the lack of unbiased estimators. Most existing entropy estimators are designed for sequences of independent events and their performance vary depending on…
▽ More
Entropy estimation is a fundamental problem in information theory that has applications in various fields, including physics, biology, and computer science. Estimating the entropy of discrete sequences can be challenging due to limited data and the lack of unbiased estimators. Most existing entropy estimators are designed for sequences of independent events and their performance vary depending on the system being studied and the available data size. In this work we compare different entropy estimators and their performance when applied to Markovian sequences. Specifically, we analyze both binary Markovian sequences and Markovian systems in the undersampled regime. We calculate the bias, standard deviation and mean squared error for some of the most widely employed estimators. We discuss the limitations of entropy estimation as a function of the transition probabilities of the Markov processes and the sample size. Overall, this paper provides a comprehensive comparison of entropy estimators and their performance in estimating entropy for systems with memory, which can be useful for researchers and practitioners in various fields.
△ Less
Submitted 17 January, 2024; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Ordinal analysis of lexical patterns
Authors:
David Sanchez,
Luciano Zunino,
Juan De Gregorio,
Raul Toral,
Claudio Mirasso
Abstract:
Words are fundamental linguistic units that connect thoughts and things through meaning. However, words do not appear independently in a text sequence. The existence of syntactic rules induces correlations among neighboring words. Using an ordinal pattern approach, we present an analysis of lexical statistical connections for 11 major languages. We find that the diverse manners that languages util…
▽ More
Words are fundamental linguistic units that connect thoughts and things through meaning. However, words do not appear independently in a text sequence. The existence of syntactic rules induces correlations among neighboring words. Using an ordinal pattern approach, we present an analysis of lexical statistical connections for 11 major languages. We find that the diverse manners that languages utilize to express word relations give rise to unique pattern structural distributions. Furthermore, fluctuations of these pattern distributions for a given language can allow us to determine both the historical period when the text was written and its author. Taken together, our results emphasize the relevance of ordinal time series analysis in linguistic typology, historical linguistics and stylometry.
△ Less
Submitted 14 March, 2023; v1 submitted 23 August, 2022;
originally announced August 2022.
-
An improved estimator of Shannon entropy with applications to systems with memory
Authors:
Juan De Gregorio,
David Sanchez,
Raul Toral
Abstract:
We investigate the memory properties of discrete sequences built upon a finite number of states. We find that the block entropy can reliably determine the memory for systems modeled as Markov chains of arbitrary finite order. Further, we provide an entropy estimator that remarkably gives accurate results when correlations are present. To illustrate our findings, we calculate the memory of daily pr…
▽ More
We investigate the memory properties of discrete sequences built upon a finite number of states. We find that the block entropy can reliably determine the memory for systems modeled as Markov chains of arbitrary finite order. Further, we provide an entropy estimator that remarkably gives accurate results when correlations are present. To illustrate our findings, we calculate the memory of daily precipitation series at different locations. Our results are in agreement with existing methods being at the same time valid in the undersampled regime and independent of model selection.
△ Less
Submitted 18 November, 2022; v1 submitted 24 May, 2022;
originally announced May 2022.