Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

Limisiewicz, Tomasz; Balhar, Jiří; Mareček, David

Computer Science > Computation and Language

arXiv:2305.17179 (cs)

[Submitted on 26 May 2023]

Title:Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

Authors:Tomasz Limisiewicz, Jiří Balhar, David Mareček

View PDF

Abstract:Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers. Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks (POS, dependency tree labeling). In contrast, NER and sentence-level tasks (cross-lingual retrieval, NLI) benefit from sharing vocabulary. We also observe that the coverage of the language-specific tokens in the multilingual vocabulary significantly impacts the word-level tasks. Our study offers a deeper understanding of the role of tokenizers in multilingual language models and guidelines for future model developers to choose the most suitable tokenizer for their specific application before undertaking costly model pre-training

Comments:	in ACL Findings 2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2305.17179 [cs.CL]
	(or arXiv:2305.17179v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.17179

Submission history

From: Tomasz Limisiewicz [view email]
[v1] Fri, 26 May 2023 18:06:49 UTC (3,503 KB)

Computer Science > Computation and Language

Title:Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators