On the Language Neutrality of Pre-trained Multilingual Representations

Libovický, **dřich; Rosa, Rudolf; Fraser, Alexander

Computer Science > Computation and Language

arXiv:2004.05160 (cs)

[Submitted on 9 Apr 2020 (v1), last revised 29 Sep 2020 (this version, v4)]

Title:On the Language Neutrality of Pre-trained Multilingual Representations

Authors:**dřich Libovický, Rudolf Rosa, Alexander Fraser

View PDF

Abstract:Multilingual contextual embeddings, such as multilingual BERT and XLM-RoBERTa, have proved useful for many multi-lingual tasks. Previous work probed the cross-linguality of the representations indirectly using zero-shot transfer learning on morphological and syntactic tasks. We instead investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings, which are explicitly trained for language neutrality. Contextual embeddings are still only moderately language-neutral by default, so we propose two simple methods for achieving stronger language neutrality: first, by unsupervised centering of the representation for each language and second, by fitting an explicit projection on small parallel data. Besides, we show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences without using parallel data.

Comments:	12 pages, 3 figures. arXiv admin note: text overlap with arXiv:1911.03310. Accepted to Findings of EMNLP 2020
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2004.05160 [cs.CL]
	(or arXiv:2004.05160v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2004.05160

Submission history

From: **dřich Libovický [view email]
[v1] Thu, 9 Apr 2020 19:50:32 UTC (223 KB)
[v2] Mon, 20 Apr 2020 11:44:10 UTC (223 KB)
[v3] Thu, 23 Apr 2020 16:10:07 UTC (224 KB)
[v4] Tue, 29 Sep 2020 18:48:19 UTC (287 KB)

Computer Science > Computation and Language

Title:On the Language Neutrality of Pre-trained Multilingual Representations

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:On the Language Neutrality of Pre-trained Multilingual Representations

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators