-
Statistical analysis of word flow among five Indo-European languages
Authors:
Josué Ely Molina,
Jorge Flores,
Carlos Gershenson,
Carlos Pineda
Abstract:
A recent increase in data availability has allowed the possibility to perform different statistical linguistic studies. Here we use the Google Books Ngram dataset to analyze word flow among English, French, German, Italian, and Spanish. We study what we define as ``migrant words'', a type of loanwords that do not change their spelling. We quantify migrant words from one language to another for dif…
▽ More
A recent increase in data availability has allowed the possibility to perform different statistical linguistic studies. Here we use the Google Books Ngram dataset to analyze word flow among English, French, German, Italian, and Spanish. We study what we define as ``migrant words'', a type of loanwords that do not change their spelling. We quantify migrant words from one language to another for different decades, and notice that most migrant words can be aggregated in semantic fields and associated to historic events. We also study the statistical properties of accumulated migrant words and their rank dynamics. We propose a measure of use of migrant words that could be used as a proxy of cultural influence. Our methodology is not exempt of caveats, but our results are encouraging to promote further studies in this direction.
△ Less
Submitted 17 January, 2023;
originally announced January 2023.
-
Language statistics at different spatial, temporal, and grammatical scales
Authors:
Fernanda Sánchez-Puig,
Rogelio Lozano-Aranda,
Dante Pérez-Méndez,
Ewan Colman,
Alfredo J. Morales-Guzmán,
Carlos Pineda,
Pedro Juan Rivera Torres,
Carlos Gershenson
Abstract:
Statistical linguistics has advanced considerably in recent decades as data has become available. This has allowed researchers to study how statistical properties of languages change over time. In this work, we use data from Twitter to explore English and Spanish considering the rank diversity at different scales: temporal (from 3 to 96 hour intervals), spatial (from 3km to 3000+km radii), and gra…
▽ More
Statistical linguistics has advanced considerably in recent decades as data has become available. This has allowed researchers to study how statistical properties of languages change over time. In this work, we use data from Twitter to explore English and Spanish considering the rank diversity at different scales: temporal (from 3 to 96 hour intervals), spatial (from 3km to 3000+km radii), and grammatical (from monograms to pentagrams). We find that all three scales are relevant. However, the greatest changes come from variations in the grammatical scale. At the lowest grammatical scale (monograms), the rank diversity curves are most similar, independently on the values of other scales, languages, and countries. As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales, as well as on the language and country. We also study the statistics of Twitter-specific tokens: emojis, hashtags, and user mentions. These particular type of tokens show a sigmoid kind of behaviour as a rank diversity function. Our results are helpful to quantify aspects of language statistics that seem universal and what may lead to variations.
△ Less
Submitted 26 July, 2022; v1 submitted 1 July, 2022;
originally announced July 2022.
-
Statistical Properties of Rankings in Sports and Games
Authors:
José Antonio Morales,
Jorge Flores,
Carlos Gershenson,
Carlos Pineda
Abstract:
Any collection can be ranked. Sports and games are common examples of ranked systems: players and teams are constantly ranked using different methods. The statistical properties of rankings have been studied for almost a century in a variety of fields. More recently, data availability has allowed us to study rank dynamics: how elements of a ranking change in time. Here, we study the rank distribut…
▽ More
Any collection can be ranked. Sports and games are common examples of ranked systems: players and teams are constantly ranked using different methods. The statistical properties of rankings have been studied for almost a century in a variety of fields. More recently, data availability has allowed us to study rank dynamics: how elements of a ranking change in time. Here, we study the rank distributions and rank dynamics of twelve datasets from different sports and games. To study rank dynamics, we consider measures we have defined previously: rank diversity, change probability, rank entropy, and rank complexity. We also introduce a new measure that we call ``system closure'' that reflects how many elements enter or leave the rankings in time. We use a random walk model to reproduce the observed rank dynamics, showing that a simple mechanism can generate similar statistical properties as the ones observed in the datasets. Our results show that, while rank distributions vary considerably for different rankings, rank dynamics have similar behaviors, independently of the nature and competitiveness of the sport or game and its ranking method. Our results also suggest that our measures of rank dynamics are general and applicable for complex systems of different natures.
△ Less
Submitted 5 November, 2021;
originally announced November 2021.
-
Identifying tax evasion in Mexico with tools from network science and machine learning
Authors:
Martin Zumaya,
Rita Guerrero,
Eduardo Islas,
Omar Pineda,
Carlos Gershenson,
Gerardo Iñiguez,
Carlos Pineda
Abstract:
Mexico has kept electronic records of all taxable transactions since 2014. Anonymized data collected by the Mexican federal government comprises more than 80 million contributors (individuals and companies) and almost 7 billion monthly-aggregations of invoices among contributors between January 2015 and December 2018. This data includes a list of almost ten thousand contributors already identified…
▽ More
Mexico has kept electronic records of all taxable transactions since 2014. Anonymized data collected by the Mexican federal government comprises more than 80 million contributors (individuals and companies) and almost 7 billion monthly-aggregations of invoices among contributors between January 2015 and December 2018. This data includes a list of almost ten thousand contributors already identified as tax evaders, due to their activities fabricating invoices for non-existing products or services so that recipients can evade taxes. Harnessing this extensive dataset, we build monthly and yearly temporal networks where nodes are contributors and directed links are invoices produced in a given time slice. Exploring the properties of the network neighborhoods around tax evaders, we show that their interaction patterns differ from those of the majority of contributors. In particular, invoicing loops between tax evaders and their clients are over-represented. With this insight, we use two machine-learning methods to classify other contributors as suspects of tax evasion: deep neural networks and random forests. We train each method with a portion of the tax evader list and test it with the rest, obtaining more than 0.9 accuracy with both methods. By using the complete dataset of contributors, each method classifies more than 100 thousand suspects of tax evasion, with more than 40 thousand suspects classified by both methods. We further reduce the number of suspects by focusing on those with a short network distance from known tax evaders. We thus obtain a list of highly suspicious contributors sorted by the amount of evaded tax, valuable information for the authorities to further investigate illegal tax activity in Mexico. With our methods, we estimate previously undetected tax evasion in the order of \$10 billion USD per year by about 10 thousand contributors.
△ Less
Submitted 24 April, 2021;
originally announced April 2021.
-
Rank diversity of languages: Generic behavior in computational linguistics
Authors:
Germinal Cocho,
Jorge Flores,
Carlos Gershenson,
Carlos Pineda,
Sergio Sánchez
Abstract:
Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution \emph{rank diversity}. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation a…
▽ More
Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution \emph{rank diversity}. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: "heads" consist of words which almost do not change their rank in time, "bodies" are words of general use, while "tails" are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied.
△ Less
Submitted 14 May, 2015;
originally announced May 2015.