Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

Jones, Alex; Caswell, Isaac; Saxena, Ishank; Firat, Orhan

Computer Science > Computation and Language

arXiv:2303.15265 (cs)

[Submitted on 27 Mar 2023]

Title:Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

Authors:Alex Jones, Isaac Caswell, Ishank Saxena, Orhan Firat

View PDF

Abstract:Neural machine translation (NMT) has progressed rapidly over the past several years, and modern models are able to achieve relatively high quality using only monolingual text data, an approach dubbed Unsupervised Machine Translation (UNMT). However, these models still struggle in a variety of ways, including aspects of translation that for a human are the easiest - for instance, correctly translating common nouns. This work explores a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Finally, we open-source GATITOS (available at this https URL), a new multilingual lexicon for 26 low-resource languages, which had the highest performance among lexica in our experiments.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
ACM classes:	I.2.7
Cite as:	arXiv:2303.15265 [cs.CL]
	(or arXiv:2303.15265v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2303.15265

Submission history

From: Alexander Jones [view email]
[v1] Mon, 27 Mar 2023 14:54:43 UTC (11,111 KB)

Computer Science > Computation and Language

Title:Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators