Showing 1–2 of 2 results for author: Marcińczuk, M

Search v0.5.6 released 2020-02-24

arXiv:2406.17474 [pdf, other]

cs.CL cs.AI

Transformer-based Named Entity Recognition with Combined Data Representation

Authors: Michał Marcińczuk

Abstract: This study examines transformer-based models and their effectiveness in named entity recognition tasks. The study investigates data representation strategies, including single, merged, and context, which respectively use one sentence, multiple sentences, and sentences joined with attention to context per vector. Analysis shows that training models with a single strategy may lead to poor performanc… ▽ More This study examines transformer-based models and their effectiveness in named entity recognition tasks. The study investigates data representation strategies, including single, merged, and context, which respectively use one sentence, multiple sentences, and sentences joined with attention to context per vector. Analysis shows that training models with a single strategy may lead to poor performance on different data representations. To address this limitation, the study proposes a combined training procedure that utilizes all three strategies to improve model stability and adaptability. The results of this approach are presented and discussed for four languages (English, Polish, Czech, and German) across various datasets, demonstrating the effectiveness of the combined strategy. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 14 pages, 6 figures
arXiv:2404.00482 [pdf, other]

cs.CL cs.AI cs.LG

Cross-lingual Named Entity Corpus for Slavic Languages

Authors: Jakub Piskorski, Michał Marcińczuk, Roman Yangarber

Abstract: This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes… ▽ More This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits - single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models - XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking. △ Less

Submitted 7 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

Comments: Published in LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

Search v0.5.6 released 2020-02-24