Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

Lai, Wen; Chronopoulou, Alexandra; Fraser, Alexander

Computer Science > Computation and Language

arXiv:2305.12786 (cs)

[Submitted on 22 May 2023 (v1), last revised 24 Oct 2023 (this version, v2)]

Title:Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

Authors:Wen Lai, Alexandra Chronopoulou, Alexander Fraser

View PDF

Abstract:Despite advances in multilingual neural machine translation (MNMT), we argue that there are still two major challenges in this area: data imbalance and representation degeneration. The data imbalance problem refers to the imbalance in the amount of parallel corpora for all language pairs, especially for long-tail languages (i.e., very low-resource languages). The representation degeneration problem refers to the problem of encoded tokens tending to appear only in a small subspace of the full space available to the MNMT model. To solve these two issues, we propose Bi-ACL, a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We define two modules, named bidirectional autoencoder and bidirectional contrastive learning, which we combine with an online constrained beam search and a curriculum learning sampling strategy. Extensive experiments show that our proposed method is more effective both in long-tail languages and in high-resource languages. We also demonstrate that our approach is capable of transferring knowledge between domains and languages in zero-shot scenarios.

Comments:	Accepted to Findings of EMNLP 2023, add statistical significance tests. code available at this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2305.12786 [cs.CL]
	(or arXiv:2305.12786v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.12786

Submission history

From: Wen Lai [view email]
[v1] Mon, 22 May 2023 07:31:08 UTC (435 KB)
[v2] Tue, 24 Oct 2023 20:10:04 UTC (445 KB)

Computer Science > Computation and Language

Title:Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators