FT Speech: Danish Parliament Speech Corpus
Authors:
Andreas Kirkedal,
Marija Stepanović,
Barbara Plank
Abstract:
This paper introduces FT Speech, a new speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely limit…
▽ More
This paper introduces FT Speech, a new speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely limited to read-aloud and dictation data. We outline design considerations, including the preprocessing methods and the alignment procedure. To evaluate the quality of the corpus, we train automatic speech recognition systems on the new resource and compare them to the systems trained on the Danish part of Språkbanken, the largest public ASR corpus for Danish to date. Our baseline results show that we achieve a 14.01 WER on the new corpus. A combination of FT Speech with in-domain language data provides comparable results to models trained specifically on Språkbanken, showing that FT Speech transfers well to this data set. Interestingly, our results demonstrate that the opposite is not the case. This shows that FT Speech provides a valuable resource for promoting research on Danish ASR with more spontaneous speech.
△ Less
Submitted 28 October, 2020; v1 submitted 25 May, 2020;
originally announced May 2020.
The Danish Gigaword Project
Authors:
Leon Strømberg-Derczynski,
Manuel R. Ciosici,
Rebekah Baglini,
Morten H. Christiansen,
Jacob Aarup Dalsgaard,
Riccardo Fusaroli,
Peter Juel Henrichsen,
Rasmus Hvingelby,
Andreas Kirkedal,
Alex Speed Kjeldsen,
Claus Ladefoged,
Finn Årup Nielsen,
Malte Lau Petersen,
Jonathan Hvithamar Rystrøm,
Daniel Varab
Abstract:
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialect…
▽ More
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialects.
△ Less
Submitted 12 May, 2021; v1 submitted 7 May, 2020;
originally announced May 2020.