Showing 1–2 of 2 results for author: Nemeskey, D M

Search v0.5.6 released 2020-02-24

arXiv:2102.10848 [pdf, other]

cs.CL

Evaluating Contextualized Language Models for Hungarian

Authors: Judit Ács, Dániel Lévai, Dávid Márk Nemeskey, András Kornai

Abstract: We present an extended comparison of contextualized language models for Hungarian. We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model. We evaluate these models through three tasks, morphological probing, POS tagging and NER. We find that huBERT works better than the other models, often by a large margin, particularly near the global optimum (ty… ▽ More We present an extended comparison of contextualized language models for Hungarian. We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model. We evaluate these models through three tasks, morphological probing, POS tagging and NER. We find that huBERT works better than the other models, often by a large margin, particularly near the global optimum (typically at the middle layers). We also find that huBERT tends to generate fewer subwords for one word and that using the last subword for token-level tasks is generally a better choice than using the first one. △ Less

Submitted 22 February, 2021; originally announced February 2021.

Journal ref: Hungarian NLP Conference (MSZNY2021)
arXiv:1701.07880 [pdf]

cs.CL

emLam -- a Hungarian Language Modeling baseline

Authors: Dávid Márk Nemeskey

Abstract: This paper aims to make up for the lack of documented baselines for Hungarian language modeling. Various approaches are evaluated on three publicly available Hungarian corpora. Perplexity values comparable to models of similar-sized English corpora are reported. A new, freely downloadable Hungar- ian benchmark corpus is introduced. This paper aims to make up for the lack of documented baselines for Hungarian language modeling. Various approaches are evaluated on three publicly available Hungarian corpora. Perplexity values comparable to models of similar-sized English corpora are reported. A new, freely downloadable Hungar- ian benchmark corpus is introduced. △ Less

Submitted 26 January, 2017; originally announced January 2017.

Comments: Additional resources: - the emLam repository: https://github.com/DavidNemeskey/emLam - the emLam corpus: http://hlt.bme.hu/en/resources/emLam

ACM Class: I.2.7

Journal ref: In Proceedings of the 13th Conference on Hungarian Computational Linguistics (MSZNY), pp. 91-102. Szeged, 2017

Search v0.5.6 released 2020-02-24