Search | arXiv e-print repository

doi 10.1017/S1351324923000190

Morphosyntactic probing of multilingual BERT models

Authors: Judit Acs, Endre Hamerlik, Roy Schwartz, Noah A. Smith, Andras Kornai

Abstract: We introduce an extensive dataset for multilingual probing of morphological information in language models (247 tasks across 42 languages from 10 families), each consisting of a sentence with a target word and a morphological tag as the desired label, derived from the Universal Dependencies treebanks. We find that pre-trained Transformer models (mBERT and XLM-RoBERTa) learn features that attain st… ▽ More We introduce an extensive dataset for multilingual probing of morphological information in language models (247 tasks across 42 languages from 10 families), each consisting of a sentence with a target word and a morphological tag as the desired label, derived from the Universal Dependencies treebanks. We find that pre-trained Transformer models (mBERT and XLM-RoBERTa) learn features that attain strong performance across these tasks. We then apply two methods to locate, for each probing task, where the disambiguating information resides in the input. The first is a new perturbation method that masks various parts of context; the second is the classical method of Shapley values. The most intriguing finding that emerges is a strong tendency for the preceding context to hold more information relevant to the prediction than the following context. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Comments: to appear in the Journal of Natural Language Engineering

arXiv:2303.00752 [pdf, other]

Safety without alignment

Authors: András Kornai, Michael Bukatin, Zsolt Zombori

Abstract: Currently, the dominant paradigm in AI safety is alignment with human values. Here we describe progress on develo** an alternative approach to safety, based on ethical rationalism (Gewirth:1978), and propose an inherently safe implementation path via hybrid theorem provers in a sandbox. As AGIs evolve, their alignment may fade, but their rationality can only increase (otherwise more rational one… ▽ More Currently, the dominant paradigm in AI safety is alignment with human values. Here we describe progress on develo** an alternative approach to safety, based on ethical rationalism (Gewirth:1978), and propose an inherently safe implementation path via hybrid theorem provers in a sandbox. As AGIs evolve, their alignment may fade, but their rationality can only increase (otherwise more rational ones will have a significant evolutionary advantage) so an approach that ties their ethics to their rationality has clear long-term advantages. △ Less

Submitted 18 March, 2023; v1 submitted 27 February, 2023; originally announced March 2023.

arXiv:2109.06327 [pdf, other]

Evaluating Transferability of BERT Models on Uralic Languages

Authors: Judit Ács, Dániel Lévai, András Kornai

Abstract: Transformer-based language models such as BERT have outperformed previous models on a large number of English benchmarks, but their evaluation is often limited to English or a small number of well-resourced languages. In this work, we evaluate monolingual, multilingual, and randomly initialized language models from the BERT family on a variety of Uralic languages including Estonian, Finnish, Hunga… ▽ More Transformer-based language models such as BERT have outperformed previous models on a large number of English benchmarks, but their evaluation is often limited to English or a small number of well-resourced languages. In this work, we evaluate monolingual, multilingual, and randomly initialized language models from the BERT family on a variety of Uralic languages including Estonian, Finnish, Hungarian, Erzya, Moksha, Karelian, Livvi, Komi Permyak, Komi Zyrian, Northern Sámi, and Skolt Sámi. When monolingual models are available (currently only et, fi, hu), these perform better on their native language, but in general they transfer worse than multilingual models or models of genetically unrelated languages that share the same character set. Remarkably, straightforward transfer of high-resource models, even without special efforts toward hyperparameter optimization, yields what appear to be state of the art POS and NER tools for the minority Uralic languages where there is sufficient data for finetuning. △ Less

Submitted 23 November, 2021; v1 submitted 13 September, 2021; originally announced September 2021.

Comments: Seventh International Workshop for Computational Linguistics of Uralic Languages (IWCLUL 2021)

arXiv:2102.10864 [pdf, other]

Subword Pooling Makes a Difference

Authors: Judit Ács, Ákos Kádár, András Kornai

Abstract: Contextual word-representations became a standard in modern natural language processing systems. These models use subword tokenization to handle large vocabularies and unknown words. Word-level usage of such systems requires a way of pooling multiple subwords that correspond to a single word. In this paper we investigate how the choice of subword pooling affects the downstream performance on three… ▽ More Contextual word-representations became a standard in modern natural language processing systems. These models use subword tokenization to handle large vocabularies and unknown words. Word-level usage of such systems requires a way of pooling multiple subwords that correspond to a single word. In this paper we investigate how the choice of subword pooling affects the downstream performance on three tasks: morphological probing, POS tagging and NER, in 9 typologically diverse languages. We compare these in two massively multilingual models, mBERT and XLM-RoBERTa. For morphological tasks, the widely used `choose the first subword' is the worst strategy and the best results are obtained by using attention over the subwords. For POS tagging both of these strategies perform poorly and the best choice is to use a small LSTM over the subwords. The same strategy works best for NER and we show that mBERT is better than XLM-RoBERTa in all 9 languages. We publicly release all code, data and the full result tables at \url{https://github.com/juditacs/subword-choice}. △ Less

Submitted 29 March, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

Journal ref: EACL2021

arXiv:2102.10848 [pdf, other]

Evaluating Contextualized Language Models for Hungarian

Authors: Judit Ács, Dániel Lévai, Dávid Márk Nemeskey, András Kornai

Abstract: We present an extended comparison of contextualized language models for Hungarian. We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model. We evaluate these models through three tasks, morphological probing, POS tagging and NER. We find that huBERT works better than the other models, often by a large margin, particularly near the global optimum (ty… ▽ More We present an extended comparison of contextualized language models for Hungarian. We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model. We evaluate these models through three tasks, morphological probing, POS tagging and NER. We find that huBERT works better than the other models, often by a large margin, particularly near the global optimum (typically at the middle layers). We also find that huBERT tends to generate fewer subwords for one word and that using the last subword for token-level tasks is generally a better choice than using the first one. △ Less

Submitted 22 February, 2021; originally announced February 2021.

Journal ref: Hungarian NLP Conference (MSZNY2021)

arXiv:2012.04575 [pdf, other]

The Role of Interpretable Patterns in Deep Learning for Morphology

Authors: Judit Acs, Andras Kornai

Abstract: We examine the role of character patterns in three tasks: morphological analysis, lemmatization and copy. We use a modified version of the standard sequence-to-sequence model, where the encoder is a pattern matching network. Each pattern scores all possible N character long subwords (substrings) on the source side, and the highest scoring subword's score is used to initialize the decoder as well a… ▽ More We examine the role of character patterns in three tasks: morphological analysis, lemmatization and copy. We use a modified version of the standard sequence-to-sequence model, where the encoder is a pattern matching network. Each pattern scores all possible N character long subwords (substrings) on the source side, and the highest scoring subword's score is used to initialize the decoder as well as the input to the attention mechanism. This method allows learning which subwords of the input are important for generating the output. By training the models on the same source but different target, we can compare what subwords are important for different tasks and how they relate to each other. We define a similarity metric, a generalized form of the Jaccard similarity, and assign a similarity score to each pair of the three tasks that work on the same source but may differ in target. We examine how these three tasks are related to each other in 12 languages. Our code is publicly available. △ Less

Submitted 8 December, 2020; originally announced December 2020.

Comments: Best paper at the Hungarian NLP conference (MSZNY2020)

Journal ref: XVI. Magyar Számítógépes Nyelvészeti Konferencia, 2020, page 171-179 (MSZNY2020)

arXiv:1905.10924 [pdf, other]

Naive probability

Authors: Zalan Gyenis, Andras Kornai

Abstract: We describe a rational, but low resolution model of probability. We describe a rational, but low resolution model of probability. △ Less

Submitted 13 December, 2021; v1 submitted 20 May, 2019; originally announced May 2019.

Comments: 8 pages

ACM Class: I.2.3; I.2.4

arXiv:1905.09139 [pdf, other]

Sentence Length

Authors: Gábor Borbély, András Kornai

Abstract: The distribution of sentence length in ordinary language is not well captured by the existing models. Here we survey previous models of sentence length and present our random walk model that offers both a better fit with the data and a better understanding of the distribution. We develop a generalization of KL divergence, discuss measuring the noise inherent in a corpus, and present a hyperparamet… ▽ More The distribution of sentence length in ordinary language is not well captured by the existing models. Here we survey previous models of sentence length and present our random walk model that offers both a better fit with the data and a better understanding of the distribution. We develop a generalization of KL divergence, discuss measuring the noise inherent in a corpus, and present a hyperparameter-free Bayesian model comparison method that has strong conceptual ties to Minimal Description Length modeling. The models we obtain require only a few dozen bits, orders of magnitude less than the naive nonparametric MDL models would. △ Less

Submitted 22 May, 2019; originally announced May 2019.

arXiv:1204.2765 [pdf, other]

doi 10.1371/journal.pone.0048386

A practical approach to language complexity: a Wikipedia case study

Authors: Taha Yasseri, András Kornai, János Kertész

Abstract: In this paper we present statistical analysis of English texts from Wikipedia. We try to address the issue of language complexity empirically by comparing the simple English Wikipedia (Simple) to comparable samples of the main English Wikipedia (Main). Simple is supposed to use a more simplified language with a limited vocabulary, and editors are explicitly requested to follow this guideline, yet… ▽ More In this paper we present statistical analysis of English texts from Wikipedia. We try to address the issue of language complexity empirically by comparing the simple English Wikipedia (Simple) to comparable samples of the main English Wikipedia (Main). Simple is supposed to use a more simplified language with a limited vocabulary, and editors are explicitly requested to follow this guideline, yet in practice the vocabulary richness of both samples are at the same level. Detailed analysis of longer units (n-grams of words and part of speech tags) shows that the language of Simple is less complex than that of Main primarily due to the use of shorter sentences, as opposed to drastically simplified syntax or vocabulary. Comparing the two language varieties by the Gunning readability index supports this conclusion. We also report on the topical dependence of language complexity, e.g. that the language is more advanced in conceptual articles compared to person-based (biographical) and object-based articles. Finally, we investigate the relation between conflict and language complexity by analyzing the content of the talk pages associated to controversial and peacefully develo** articles, concluding that controversy has the effect of reducing language complexity. △ Less

Submitted 18 August, 2012; v1 submitted 12 April, 2012; originally announced April 2012.

Comments: 2 new figures, 1 new section, and 2 new supporting texts

Journal ref: PLoS ONE 7(11): e48386 (2012)

arXiv:1202.3643 [pdf, other]

doi 10.1371/journal.pone.0038869

Dynamics of conflicts in Wikipedia

Authors: Taha Yasseri, Robert Sumi, András Rung, András Kornai, János Kertész

Abstract: In this work we study the dynamical features of editorial wars in Wikipedia (WP). Based on our previously established algorithm, we build up samples of controversial and peaceful articles and analyze the temporal characteristics of the activity in these samples. On short time scales, we show that there is a clear correspondence between conflict and burstiness of activity patterns, and that memory… ▽ More In this work we study the dynamical features of editorial wars in Wikipedia (WP). Based on our previously established algorithm, we build up samples of controversial and peaceful articles and analyze the temporal characteristics of the activity in these samples. On short time scales, we show that there is a clear correspondence between conflict and burstiness of activity patterns, and that memory effects play an important role in controversies. On long time scales, we identify three distinct developmental patterns for the overall behavior of the articles. We are able to distinguish cases eventually leading to consensus from those cases where a compromise is far from achievable. Finally, we analyze discussion networks and conclude that edit wars are mainly fought by few editors only. △ Less

Submitted 2 May, 2012; v1 submitted 16 February, 2012; originally announced February 2012.

Comments: Supporting information added

Journal ref: PLoS ONE 7(6): e38869 (2012)

arXiv:1107.3689 [pdf, other]

doi 10.1109/PASSAT/SocialCom.2011.47

Edit wars in Wikipedia

Authors: Róbert Sumi, Taha Yasseri, András Rung, András Kornai, János Kertész

Abstract: We present a new, efficient method for automatically detecting severe conflicts `edit wars' in Wikipedia and evaluate this method on six different language WPs. We discuss how the number of edits, reverts, the length of discussions, the burstiness of edits and reverts deviate in such pages from those following the general workflow, and argue that earlier work has significantly over-estimated the c… ▽ More We present a new, efficient method for automatically detecting severe conflicts `edit wars' in Wikipedia and evaluate this method on six different language WPs. We discuss how the number of edits, reverts, the length of discussions, the burstiness of edits and reverts deviate in such pages from those following the general workflow, and argue that earlier work has significantly over-estimated the contentiousness of the Wikipedia editing process. △ Less

Submitted 9 February, 2012; v1 submitted 19 July, 2011; originally announced July 2011.

Comments: 4 pages, 2 figures, 3 tables. The current version is shortened to be published in SocialCom 2011

Journal ref: IEEE Third International Conference on Social Computing (SocialCom), 9-11 Oct. 2011, 724-727, Boston, MA, USA

Showing 1–11 of 11 results for author: Kornai, A