Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies

Wu, Zhengxuan; Tamkin, Alex; Papadimitriou, Isabel

Computer Science > Computation and Language

arXiv:2202.12312 (cs)

[Submitted on 24 Feb 2022 (v1), last revised 23 Jan 2024 (this version, v2)]

Title:Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies

Authors:Zhengxuan Wu, Alex Tamkin, Isabel Papadimitriou

View PDF HTML (experimental)

Abstract:When we transfer a pretrained language model to a new language, there are many axes of variation that change at once. To disentangle the impact of different factors like syntactic similarity and vocabulary similarity, we propose a set of controlled transfer studies: we systematically transform the language of the GLUE benchmark, altering one axis of crosslingual variation at a time, and then measure the resulting drops in a pretrained model's downstream performance. We find that models can largely recover from syntactic-style shifts, but cannot recover from vocabulary misalignment and embedding matrix re-initialization, even with continued pretraining on 15 million tokens. %On the other hand, transferring to a dataset with an unaligned vocabulary is extremely hard to recover from in the low-data regime. Moreover, good-quality tokenizers in the transfer language do not make vocabulary alignment easier. Our experiments provide insights into the factors of cross-lingual transfer that researchers should most focus on when designing language transfer scenarios.

Comments:	EMNLP 2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2202.12312 [cs.CL]
	(or arXiv:2202.12312v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2202.12312

Submission history

From: Zhengxuan Wu [view email]
[v1] Thu, 24 Feb 2022 19:00:39 UTC (1,970 KB)
[v2] Tue, 23 Jan 2024 22:09:07 UTC (3,851 KB)

Computer Science > Computation and Language

Title:Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators