-
The Danish Gigaword Project
Authors:
Leon Strømberg-Derczynski,
Manuel R. Ciosici,
Rebekah Baglini,
Morten H. Christiansen,
Jacob Aarup Dalsgaard,
Riccardo Fusaroli,
Peter Juel Henrichsen,
Rasmus Hvingelby,
Andreas Kirkedal,
Alex Speed Kjeldsen,
Claus Ladefoged,
Finn Årup Nielsen,
Malte Lau Petersen,
Jonathan Hvithamar Rystrøm,
Daniel Varab
Abstract:
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialect…
▽ More
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialects.
△ Less
Submitted 12 May, 2021; v1 submitted 7 May, 2020;
originally announced May 2020.
-
Memory limitations are hidden in grammar
Authors:
Carlos Gómez-Rodríguez,
Morten H. Christiansen,
Ramon Ferrer-i-Cancho
Abstract:
The ability to produce and understand an unlimited number of different sentences is a hallmark of human language. Linguists have sought to define the essence of this generative capacity using formal grammars that describe the syntactic dependencies between constituents, independent of the computational limitations of the human brain. Here, we evaluate this independence assumption by sampling sente…
▽ More
The ability to produce and understand an unlimited number of different sentences is a hallmark of human language. Linguists have sought to define the essence of this generative capacity using formal grammars that describe the syntactic dependencies between constituents, independent of the computational limitations of the human brain. Here, we evaluate this independence assumption by sampling sentences uniformly from the space of possible syntactic structures. We find that the average dependency distance between syntactically related words, a proxy for memory limitations, is less than expected by chance in a collection of state-of-the-art classes of dependency grammars. Our findings indicate that memory limitations have permeated grammatical descriptions, suggesting that it may be impossible to build a parsimonious theory of human linguistic productivity independent of non-linguistic cognitive constraints.
△ Less
Submitted 5 April, 2022; v1 submitted 19 August, 2019;
originally announced August 2019.
-
Networks in Cognitive Science
Authors:
Andrea Baronchelli,
Ramon Ferrer-i-Cancho,
Romualdo Pastor-Satorras,
Nick Chater,
Morten H. Christiansen
Abstract:
Networks of interconnected nodes have long played a key role in Cognitive Science, from artificial neural net- works to spreading activation models of semantic mem- ory. Recently, however, a new Network Science has been developed, providing insights into the emergence of global, system-scale properties in contexts as diverse as the Internet, metabolic reactions, and collaborations among scientists…
▽ More
Networks of interconnected nodes have long played a key role in Cognitive Science, from artificial neural net- works to spreading activation models of semantic mem- ory. Recently, however, a new Network Science has been developed, providing insights into the emergence of global, system-scale properties in contexts as diverse as the Internet, metabolic reactions, and collaborations among scientists. Today, the inclusion of network theory into Cognitive Sciences, and the expansion of complex- systems science, promises to significantly change the way in which the organization and dynamics of cognitive and behavioral processes are understood. In this paper, we review recent contributions of network theory at different levels and domains within the Cognitive Sciences.
△ Less
Submitted 5 July, 2013; v1 submitted 24 April, 2013;
originally announced April 2013.
-
The Biological Origin of Linguistic Diversity
Authors:
Andrea Baronchelli,
Nick Chater,
Romualdo Pastor-Satorras,
Morten H. Christiansen
Abstract:
In contrast with animal communication systems, diversity is characteristic of almost every aspect of human language. Languages variously employ tones, clicks, or manual signs to signal differences in meaning; some languages lack the noun-verb distinction (e.g., Straits Salish), whereas others have a proliferation of fine-grained syntactic categories (e.g., Tzeltal); and some languages do without m…
▽ More
In contrast with animal communication systems, diversity is characteristic of almost every aspect of human language. Languages variously employ tones, clicks, or manual signs to signal differences in meaning; some languages lack the noun-verb distinction (e.g., Straits Salish), whereas others have a proliferation of fine-grained syntactic categories (e.g., Tzeltal); and some languages do without morphology (e.g., Mandarin), while others pack a whole sentence into a single word (e.g., Cayuga). A challenge for evolutionary biology is to reconcile the diversity of languages with the high degree of biological uniformity of their speakers. Here, we model processes of language change and geographical dispersion and find a consistent pressure for flexible learning, irrespective of the language being spoken. This pressure arises because flexible learners can best cope with the observed high rates of linguistic change associated with divergent cultural evolution following human migration. Thus, rather than genetic adaptations for specific aspects of language, such as recursion, the coevolution of genes and fast-changing linguistic structure provides the biological basis for linguistic diversity. Only biological adaptations for flexible learning combined with cultural evolution can explain how each child has the potential to learn any human language.
△ Less
Submitted 12 February, 2013;
originally announced February 2013.