Skip to main content

Showing 1–7 of 7 results for author: Breiner, T

.
  1. arXiv:2207.00706  [pdf, other

    eess.AS cs.CL cs.LG

    UserLibri: A Dataset for ASR Personalization Using Only Text

    Authors: Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey

    Abstract: Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech co… ▽ More

    Submitted 1 July, 2022; originally announced July 2022.

    Comments: Accepted for publication in Interspeech 2022. 9 total pages with appendix, 9 total tables, 5 total figures

  2. arXiv:2205.03983  [pdf, other

    cs.CL cs.AI cs.LG

    Building Machine Translation Systems for the Next Thousand Languages

    Authors: Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yan** Huang, Zhifeng Chen, Yonghui Wu, Macduff Hughes

    Abstract: In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and develo** data-driven filtering techniques; (ii) Develo**… ▽ More

    Submitted 6 July, 2022; v1 submitted 8 May, 2022; originally announced May 2022.

    Comments: V2: updated with some details from 24-language Google Translate launch in May 2022 V3: spelling corrections, additional acknowledgements

  3. arXiv:2204.09715  [pdf, other

    cs.CL cs.LG

    Scaling Language Model Size in Cross-Device Federated Learning

    Authors: Jae Hun Ro, Theresa Breiner, Lara McConnaughey, Mingqing Chen, Ananda Theertha Suresh, Shankar Kumar, Rajiv Mathews

    Abstract: Most studies in cross-device federated learning focus on small models, due to the server-client communication and on-device computation bottlenecks. In this work, we leverage various techniques for mitigating these bottlenecks to train larger language models in cross-device federated learning. With systematic applications of partial model training, quantization, efficient transfer learning, and co… ▽ More

    Submitted 24 June, 2022; v1 submitted 31 March, 2022; originally announced April 2022.

  4. arXiv:2101.11575  [pdf, other

    cs.CL

    Mining Large-Scale Low-Resource Pronunciation Data From Wikipedia

    Authors: Tania Chakraborty, Manasa Prasad, Theresa Breiner, Sandy Ritchie, Daan van Esch

    Abstract: Pronunciation modeling is a key task for building speech technology in new languages, and while solid grapheme-to-phoneme (G2P) map** systems exist, language coverage can stand to be improved. The information needed to build G2P models for many more languages can easily be found on Wikipedia, but unfortunately, it is stored in disparate formats. We report on a system we built to mine a pronuncia… ▽ More

    Submitted 27 January, 2021; originally announced January 2021.

    Comments: 7 pages, 9 figures

  5. arXiv:2010.14571  [pdf, other

    cs.CL cs.LG

    Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

    Authors: Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna

    Abstract: Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models o… ▽ More

    Submitted 29 October, 2020; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: Accepted to COLING 2020. 9 pages with 8 page abstract

  6. arXiv:1912.01218  [pdf

    cs.HC cs.CL

    Writing Across the World's Languages: Deep Internationalization for Gboard, the Google Keyboard

    Authors: Daan van Esch, Elnaz Sarbar, Tamar Lucassen, Jeremy O'Brien, Theresa Breiner, Manasa Prasad, Evan Crew, Chieu Nguyen, Françoise Beaufays

    Abstract: This technical report describes our deep internationalization program for Gboard, the Google Keyboard. Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report describes how and why we have been adding support for hundreds of language varieties from around the globe. Many languages of the world are increasingly used in writing on an everyday basis, and we describe… ▽ More

    Submitted 3 December, 2019; originally announced December 2019.

  7. arXiv:1901.06039  [pdf, other

    cs.CL cs.CY

    Automatic Keyboard Layout Design for Low-Resource Latin-Script Languages

    Authors: Theresa Breiner, Chieu Nguyen, Daan van Esch, Jeremy O'Brien

    Abstract: We present our approach to automatically designing and implementing keyboard layouts on mobile devices for ty** low-resource languages written in the Latin script. For many speakers, one of the barriers in accessing and creating text content on the web is the absence of input tools for their language. Ease in ty** in these languages would lower technological barriers to online communication an… ▽ More

    Submitted 17 January, 2019; originally announced January 2019.

    Comments: 4 pages, 8 figures