-
DANSK and DaCy 2.6.0: Domain Generalization of Danish Named Entity Recognition
Authors:
Kenneth Enevoldsen,
Emil Trenckner Jessen,
Rebekah Baglini
Abstract:
Named entity recognition is one of the cornerstones of Danish NLP, essential for language technology applications within both industry and research. However, Danish NER is inhibited by a lack of available datasets. As a consequence, no current models are capable of fine-grained named entity recognition, nor have they been evaluated for potential generalizability issues across datasets and domains.…
▽ More
Named entity recognition is one of the cornerstones of Danish NLP, essential for language technology applications within both industry and research. However, Danish NER is inhibited by a lack of available datasets. As a consequence, no current models are capable of fine-grained named entity recognition, nor have they been evaluated for potential generalizability issues across datasets and domains. To alleviate these limitations, this paper introduces: 1) DANSK: a named entity dataset providing for high-granularity tagging as well as within-domain evaluation of models across a diverse set of domains; 2) DaCy 2.6.0 that includes three generalizable models with fine-grained annotation; and 3) an evaluation of current state-of-the-art models' ability to generalize across domains. The evaluation of existing and new models revealed notable performance discrepancies across domains, which should be addressed within the field. Shortcomings of the annotation quality of the dataset and its impact on model training and evaluation are also discussed. Despite these limitations, we advocate for the use of the new dataset DANSK alongside further work on the generalizability within Danish NER.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Natural Language Processing 4 All (NLP4All): A New Online Platform for Teaching and Learning NLP Concepts
Authors:
Rebekah Baglini,
Arthur Hjorth
Abstract:
Natural Language Processing offers new insights into language data across almost all disciplines and domains, and allows us to corroborate and/or challenge existing knowledge. The primary hurdles to widening participation in and use of these new research tools are, first, a lack of coding skills in students across K-16, and in the population at large, and second, a lack of knowledge of how NLP-met…
▽ More
Natural Language Processing offers new insights into language data across almost all disciplines and domains, and allows us to corroborate and/or challenge existing knowledge. The primary hurdles to widening participation in and use of these new research tools are, first, a lack of coding skills in students across K-16, and in the population at large, and second, a lack of knowledge of how NLP-methods can be used to answer questions of disciplinary interest outside of linguistics and/or computer science. To broaden participation in NLP and improve NLP-literacy, we introduced a new tool web-based tool called Natural Language Processing 4 All (NLP4All). The intended purpose of NLP4All is to help teachers facilitate learning with and about NLP, by providing easy-to-use interfaces to NLP-methods, data, and analyses, making it possible for non- and novice-programmers to learn NLP concepts interactively.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
When no news is bad news -- Detection of negative events from news media content
Authors:
Kristoffer L. Nielbo,
Frida Haestrup,
Kenneth C. Enevoldsen,
Peter B. Vahlstrup,
Rebekah B. Baglini,
Andreas Roepstorff
Abstract:
During the first wave of Covid-19 information decoupling could be observed in the flow of news media content. The corollary of the content alignment within and between news sources experienced by readers (i.e., all news transformed into Corona-news), was that the novelty of news content went down as media focused monotonically on the pandemic event. This all-important Covid-19 news theme turned ou…
▽ More
During the first wave of Covid-19 information decoupling could be observed in the flow of news media content. The corollary of the content alignment within and between news sources experienced by readers (i.e., all news transformed into Corona-news), was that the novelty of news content went down as media focused monotonically on the pandemic event. This all-important Covid-19 news theme turned out to be quite persistent as the pandemic continued, resulting in the, from a news media's perspective, paradoxical situation where the same news was repeated over and over. This information phenomenon, where novelty decreases and persistence increases, has previously been used to track change in news media, but in this study we specifically test the claim that new information decoupling behavior of media can be used to reliably detect change in news media content originating in a negative event, using a Bayesian approach to change point detection.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
News Information Decoupling: An Information Signature of Catastrophes in Legacy News Media
Authors:
Kristoffer L. Nielbo,
Rebekah B. Baglini,
Peter B. Vahlstrup,
Kenneth C. Enevoldsen,
Anja Bechmann,
Andreas Roepstorff
Abstract:
Content alignment in news media was an observable information effect of Covid-19's initial phase. During the first half of 2020, legacy news media became "corona news" following national outbreak and crises management patterns. While news media are neither unbiased nor infallible as sources of events, they do provide a window into socio-cultural responses to events. In this paper, we use legacy pr…
▽ More
Content alignment in news media was an observable information effect of Covid-19's initial phase. During the first half of 2020, legacy news media became "corona news" following national outbreak and crises management patterns. While news media are neither unbiased nor infallible as sources of events, they do provide a window into socio-cultural responses to events. In this paper, we use legacy print media to empirically derive the principle News Information Decoupling (NID) that functions as an information signature of culturally significant catastrophic event. Formally, NID can provide input to change detection algorithms and points to several unsolved research problems in the intersection of information theory and media studies.
△ Less
Submitted 8 January, 2021;
originally announced January 2021.
-
The Danish Gigaword Project
Authors:
Leon Strømberg-Derczynski,
Manuel R. Ciosici,
Rebekah Baglini,
Morten H. Christiansen,
Jacob Aarup Dalsgaard,
Riccardo Fusaroli,
Peter Juel Henrichsen,
Rasmus Hvingelby,
Andreas Kirkedal,
Alex Speed Kjeldsen,
Claus Ladefoged,
Finn Årup Nielsen,
Malte Lau Petersen,
Jonathan Hvithamar Rystrøm,
Daniel Varab
Abstract:
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialect…
▽ More
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialects.
△ Less
Submitted 12 May, 2021; v1 submitted 7 May, 2020;
originally announced May 2020.