Skip to main content

Showing 1–3 of 3 results for author: Baljekar, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2205.03983  [pdf, other

    cs.CL cs.AI cs.LG

    Building Machine Translation Systems for the Next Thousand Languages

    Authors: Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yan** Huang, Zhifeng Chen, Yonghui Wu, Macduff Hughes

    Abstract: In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and develo** data-driven filtering techniques; (ii) Develo**… ▽ More

    Submitted 6 July, 2022; v1 submitted 8 May, 2022; originally announced May 2022.

    Comments: V2: updated with some details from 24-language Google Translate launch in May 2022 V3: spelling corrections, additional acknowledgements

  2. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

    Authors: Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller , et al. (27 additional authors not shown)

    Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system… ▽ More

    Submitted 21 February, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted at TACL; pre-MIT Press publication version

    Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 50-72

  3. arXiv:1912.03393  [pdf, other

    cs.CL cs.AI cs.LG

    Re-Translation Strategies For Long Form, Simultaneous, Spoken Language Translation

    Authors: Naveen Arivazhagan, Colin Cherry, Te I, Wolfgang Macherey, Pallavi Baljekar, George Foster

    Abstract: We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repea… ▽ More

    Submitted 7 April, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

    Comments: ICASSP 2020