-
ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair
Abstract: We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples… ▽ More
Submitted 9 May, 2022; originally announced May 2022.
Comments: 10 pages, 3 figures, 6 tables. Accepted at PACLIC 2021. (ACL Anthology link: https://aclanthology.org/2021.paclic-1.56/)
MSC Class: 68T50 ACM Class: I.2.7; I.2.6