Skip to main content

Showing 1–5 of 5 results for author: Padfield, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.18669  [pdf, other

    cs.LG cs.AI cs.CL eess.AS

    Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

    Authors: Vicky Zayats, Peter Chen, Melissa Ferrari, Dirk Padfield

    Abstract: Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain gener… ▽ More

    Submitted 31 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Under review at NeurIPS

  2. arXiv:2306.12925  [pdf, other

    cs.CL cs.AI cs.SD eess.AS stat.ML

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Authors: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats , et al. (5 additional authors not shown)

    Abstract: We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

    Comments: Technical report

  3. arXiv:2208.03393  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Chronological Self-Training for Real-Time Speaker Diarization

    Authors: Dirk Padfield, Daniel J. Liebling

    Abstract: Diarization partitions an audio stream into segments based on the voices of the speakers. Real-time diarization systems that include an enrollment step should limit enrollment training samples to reduce user interaction time. Although training on a small number of samples yields poor performance, we show that the accuracy can be improved dramatically using a chronological self-training approach. W… ▽ More

    Submitted 5 August, 2022; originally announced August 2022.

    Comments: 5 pages, 5 figures, ICASSP 2021

    Journal ref: Proc. Interspeech (2021) 4613-4617

  4. arXiv:2109.06952  [pdf, other

    cs.CL cs.SD eess.AS

    Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech

    Authors: Katrin Tomanek, Vicky Zayats, Dirk Padfield, Kara Vaillancourt, Fadi Biadsy

    Abstract: Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly an… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Comments: Accepted to EMNLP 2021

  5. arXiv:2010.11132  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Sentence Boundary Augmentation For Neural Machine Translation Robustness

    Authors: Daniel Li, Te I, Naveen Arivazhagan, Colin Cherry, Dirk Padfield

    Abstract: Neural Machine Translation (NMT) models have demonstrated strong state of the art performance on translation tasks where well-formed training and evaluation data are provided, but they remain sensitive to inputs that include errors of various types. Specifically, in the context of long-form speech translation systems, where the input transcripts come from Automatic Speech Recognition (ASR), the NM… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

    Comments: 5 pages, 4 figures