The Danish Gigaword Project
Authors:
Leon Strømberg-Derczynski,
Manuel R. Ciosici,
Rebekah Baglini,
Morten H. Christiansen,
Jacob Aarup Dalsgaard,
Riccardo Fusaroli,
Peter Juel Henrichsen,
Rasmus Hvingelby,
Andreas Kirkedal,
Alex Speed Kjeldsen,
Claus Ladefoged,
Finn Årup Nielsen,
Malte Lau Petersen,
Jonathan Hvithamar Rystrøm,
Daniel Varab
Abstract:
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialect…
▽ More
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialects.
△ Less
Submitted 12 May, 2021; v1 submitted 7 May, 2020;
originally announced May 2020.