Showing 1–2 of 2 results for author: Kåsen, A

Search v0.5.6 released 2020-02-24

arXiv:2305.13527 [pdf, other]

cs.CL

Aligning the Norwegian UD Treebank with Entity and Coreference Information

Authors: Tollef Emil Jørgensen, Andre Kåsen

Abstract: This paper presents a merged collection of entity and coreference annotated data grounded in the Universal Dependencies (UD) treebanks for the two written forms of Norwegian: Bokmål and Nynorsk. The aligned and converted corpora are the Norwegian Named Entities (NorNE) and Norwegian Anaphora Resolution Corpus (NARC). While NorNE is aligned with an older version of the treebank, NARC is misaligned… ▽ More This paper presents a merged collection of entity and coreference annotated data grounded in the Universal Dependencies (UD) treebanks for the two written forms of Norwegian: Bokmål and Nynorsk. The aligned and converted corpora are the Norwegian Named Entities (NorNE) and Norwegian Anaphora Resolution Corpus (NARC). While NorNE is aligned with an older version of the treebank, NARC is misaligned and requires extensive transformation from the original annotations to the UD structure and CoNLL-U format. We here demonstrate the conversion and alignment processes, along with an analysis of discovered issues and errors in the data - some of which include data split overlaps in the original treebank. These procedures and the developed system may prove helpful for future corpus alignment and coreference annotation endeavors. The merged corpora comprise the first Norwegian UD treebank enriched with named entities and coreference information. △ Less

Submitted 25 May, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: 4 pages, 1 table. Appendix: 3 tables and 5 data examples

ACM Class: I.2.7
arXiv:2210.06150 [pdf, other]

cs.CL

Annotating Norwegian Language Varieties on Twitter for Part-of-Speech

Authors: Petter Mæhlum, Andre Kåsen, Samia Touileb, Jeremy Barnes

Abstract: Norwegian Twitter data poses an interesting challenge for Natural Language Processing (NLP) tasks. These texts are difficult for models trained on standardized text in one of the two Norwegian written forms (Bokmål and Nynorsk), as they contain both the typical variation of social media text, as well as a large amount of dialectal variety. In this paper we present a novel Norwegian Twitter dataset… ▽ More Norwegian Twitter data poses an interesting challenge for Natural Language Processing (NLP) tasks. These texts are difficult for models trained on standardized text in one of the two Norwegian written forms (Bokmål and Nynorsk), as they contain both the typical variation of social media text, as well as a large amount of dialectal variety. In this paper we present a novel Norwegian Twitter dataset annotated with POS-tags. We show that models trained on Universal Dependency (UD) data perform worse when evaluated against this dataset, and that models trained on Bokmål generally perform better than those trained on Nynorsk. We also see that performance on dialectal tweets is comparable to the written standards for some models. Finally we perform a detailed analysis of the errors that models commonly make on this data. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted at the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects (Vardial2022). Collocated with COLING2022

Search v0.5.6 released 2020-02-24