Skip to main content

Showing 1–27 of 27 results for author: Rosenberg, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.14701  [pdf, other

    cs.AI cs.CL cs.SD eess.AS

    Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

    Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Neeraj Gaur, Zhong Meng

    Abstract: In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  2. arXiv:2406.06664  [pdf, other

    eess.AS cs.LG cs.SD

    ASTRA: Aligning Speech and Text Representations for Asr without Sampling

    Authors: Neeraj Gaur, Rohan Agrawal, Gary Wang, Parisa Haghani, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: This paper introduces ASTRA, a novel method for improving Automatic Speech Recognition (ASR) through text injection.Unlike prevailing techniques, ASTRA eliminates the need for sampling to match sequence lengths between speech and text modalities. Instead, it leverages the inherent alignments learned within CTC/RNNT models. This approach offers the following two advantages, namely, avoiding potenti… ▽ More

    Submitted 13 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: To be published in Interspeech 2024

  3. arXiv:2402.18932  [pdf, other

    eess.AS cs.SD

    Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

    Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

    Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: To appear in ICASSP 2024

  4. arXiv:2402.01115  [pdf, other

    cs.CL eess.SP

    Interpretation of Intracardiac Electrograms Through Textual Representations

    Authors: William Jongwon Han, Diana Gomez, Avi Alok, Chao**g Duan, Michael A. Rosenberg, Douglas Weber, Emerson Liu, Ding Zhao

    Abstract: Understanding the irregular electrical activity of atrial fibrillation (AFib) has been a key challenge in electrocardiography. For serious cases of AFib, catheter ablations are performed to collect intracardiac electrograms (EGMs). EGMs offer intricately detailed and localized electrical activity of the heart and are an ideal modality for interpretable cardiac studies. Recent advancements in artif… ▽ More

    Submitted 11 April, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

    Comments: 18 pages, 9 figures; Accepted to CHIL 2024

    ACM Class: I.2.7; J.3

  5. arXiv:2401.04235  [pdf, other

    cs.CL cs.SD eess.AS

    High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

    Authors: Christopher Li, Gary Wang, Kyle Kastner, Heng Su, Allen Chen, Andrew Rosenberg, Zhehuai Chen, Zelin Wu, Leonid Velikovich, Pat Rondon, Diamantino Caseiro, Petar Aleksic

    Abstract: Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternatives to the hypothesis text using nearest-neighbors search over embeddings of the ASR hypothesis t… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  6. arXiv:2308.07486  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    O-1: Self-training with Oracle and 1-best Hypothesis

    Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik Audhkhasi

    Abstract: We introduce O-1, a new self-training objective to reduce training bias and unify training and evaluation metrics for speech recognition. O-1 is a faster variant of Expected Minimum Bayes Risk (EMBR), that boosts the oracle hypothesis and can accommodate both supervised and unsupervised data. We demonstrate the effectiveness of our approach in terms of recognition on publicly available SpeechStew… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

  7. arXiv:2308.07393  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Using Text Injection to Improve Recognition of Personal Identifiers in Speech

    Authors: Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, Bhuvana Ramabhadran

    Abstract: Accurate recognition of specific categories, such as persons' names, dates or other identifiers is critical in many Automatic Speech Recognition (ASR) applications. As these categories represent personal information, ethical use of this data including collection, transcription, training and evaluation demands special care. One way of ensuring the security and privacy of individuals is to redact or… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted to Interspeech 2023

    MSC Class: 68T10 ACM Class: I.2.7

  8. arXiv:2308.06125  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Improving Joint Speech-Text Representations Without Alignment

    Authors: Cal Peyser, Zhong Meng, Ke Hu, Rohit Prabhavalkar, Andrew Rosenberg, Tara N. Sainath, Michael Picheny, Kyunghyun Cho

    Abstract: The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these metho… ▽ More

    Submitted 11 August, 2023; originally announced August 2023.

    Journal ref: INTERSPEECH 2023

  9. arXiv:2304.14514  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Understanding Shared Speech-Text Representations

    Authors: Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang

    Abstract: Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance. In this paper, we expandour understanding of the resulting shared speech-text representationswith two types of analyses. First we examine the limits of speech-fr… ▽ More

    Submitted 27 April, 2023; originally announced April 2023.

    Comments: Accepted at ICASSP 2023, camera ready

  10. arXiv:2303.01037  [pdf, other

    cs.CL cs.SD eess.AS

    Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

    Authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk , et al. (2 additional authors not shown)

    Abstract: We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quant… ▽ More

    Submitted 24 September, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: 20 pages, 7 figures, 8 tables

  11. arXiv:2302.08583  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

    Authors: Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang, Bo Li, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: 5 pages, 3 figures, in ICASSP 2023

    Journal ref: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes island, Greece

  12. arXiv:2210.15447  [pdf, other

    cs.SD cs.CL eess.AS

    Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

    Authors: Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired da… ▽ More

    Submitted 15 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: To appear in ICASSP 2023

  13. arXiv:2210.10879  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR

    Authors: Gary Wang, Ekin D. Cubuk, Andrew Rosenberg, Shuyang Cheng, Ron J. Weiss, Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le, Daniel S. Park

    Abstract: Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as… ▽ More

    Submitted 24 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: 6 pages, accepted at SLT 2022. Updated with copyright

  14. arXiv:2210.10027  [pdf, other

    cs.CL cs.SD eess.AS

    Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

    Authors: Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Pedro Moreno, Nanxin Chen

    Abstract: Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech a… ▽ More

    Submitted 21 October, 2022; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted by SLT 2022

    MSC Class: 68T10 ACM Class: I.2.7

  15. arXiv:2209.06987  [pdf, other

    cs.SD cs.LG eess.AS

    Non-Parallel Voice Conversion for ASR Augmentation

    Authors: Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Yinghui Huang, Jesse Emond, Pedro Moreno Mengibar

    Abstract: Automatic speech recognition (ASR) needs to be robust to speaker differences. Voice Conversion (VC) modifies speaker characteristics of input speech. This is an attractive feature for ASR data augmentation. In this paper, we demonstrate that voice conversion can be used as a data augmentation technique to improve ASR performance, even on LibriSpeech, which contains 2,456 speakers. For ASR augmenta… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: Accepted by Interspeech 2022

    MSC Class: 68T10 ACM Class: I.2.7

  16. arXiv:2208.01220  [pdf, other

    stat.ML cs.LG eess.SP

    GeoECG: Data Augmentation via Wasserstein Geodesic Perturbation for Robust Electrocardiogram Prediction

    Authors: Jiacheng Zhu, Jielin Qiu, Zhuolin Yang, Douglas Weber, Michael A. Rosenberg, Emerson Liu, Bo Li, Ding Zhao

    Abstract: There has been an increased interest in applying deep neural networks to automatically interpret and analyze the 12-lead electrocardiogram (ECG). The current paradigms with machine learning methods are often limited by the amount of labeled data. This phenomenon is particularly problematic for clinically-relevant data, where labeling at scale can be time-consuming and costly in terms of the specia… ▽ More

    Submitted 10 August, 2022; v1 submitted 1 August, 2022; originally announced August 2022.

    Comments: 26 pages, Figure 13, Machine Learning for Healthcare 2022

    Journal ref: Machine Learning for Healthcare 2022, JMLR Volume 182

  17. arXiv:2205.08014  [pdf, ps, other

    eess.AS cs.SD

    Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

    Authors: Alëna Aksënova, Zhehuai Chen, Chung-Cheng Chiu, Daan van Esch, Pavel Golik, Wei Han, Levi King, Bhuvana Ramabhadran, Andrew Rosenberg, Suzan Schwartz, Gary Wang

    Abstract: Building inclusive speech recognition systems is a crucial step towards develo** technologies that speakers of all language varieties can use. Therefore, ASR systems must work for everybody independently of the way they speak. To accomplish this goal, there should be available data sets representing language varieties, and also an understanding of model configuration that is the most helpful in… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

    Comments: 5 pages, 3 tables

  18. arXiv:2204.03409  [pdf, other

    cs.CL cs.SD eess.AS

    MAESTRO: Matched Speech Text Representations through Modality Matching

    Authors: Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Moreno, Ankur Bapna, Heiga Zen

    Abstract: We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information. Learning aligned representations from unpaired speech and text sequences is a challenging task.… ▽ More

    Submitted 1 July, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted by Interspeech 2022

    MSC Class: 68T10 ACM Class: I.2.7

  19. A Scalable Model Specialization Framework for Training and Inference using Submodels and its Application to Speech Model Personalization

    Authors: Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro J. Moreno

    Abstract: Model fine-tuning and adaptation have become a common approach for model specialization for downstream tasks or domains. Fine-tuning the entire model or a subset of the parameters using light-weight adaptation has shown considerable success across different specialization tasks. Fine-tuning a model for a large number of domains typically requires starting a new training job for every domain posing… ▽ More

    Submitted 13 September, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH

  20. arXiv:2202.12719  [pdf, other

    cs.SD cs.CL eess.AS

    Ask2Mask: Guided Data Selection for Masked Speech Modeling

    Authors: Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Pedro Moreno

    Abstract: Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant informati… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

  21. arXiv:2108.12226  [pdf, other

    cs.CL cs.SD eess.AS

    Injecting Text in Self-Supervised Speech Pretraining

    Authors: Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro Moreno

    Abstract: Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speec… ▽ More

    Submitted 27 August, 2021; originally announced August 2021.

    Comments: submit to ASRU 2021

    MSC Class: 68T10 ACM Class: I.2.7

  22. arXiv:2102.11433  [pdf

    eess.IV cs.AI

    Histo-fetch -- On-the-fly processing of gigapixel whole slide images simplifies and speeds neural network training

    Authors: Brendon Lutnick, Leema Krishna Murali, Brandon Ginley, Avi Z. Rosenberg, Pinaki Sarder

    Abstract: We created a custom pipeline (histo-fetch) to efficiently extract random patches and labels from pathology whole slide images (WSIs) for input to a neural network on-the-fly. We prefetch these patches as needed during network training, avoiding the need for WSI preparation such as chop**/tiling. We demonstrate the utility of this pipeline to perform artificial stain transfer and image generation… ▽ More

    Submitted 1 March, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

    Comments: 5 pages, 2 figures

  23. arXiv:2002.12868  [pdf

    q-bio.TO cs.CV eess.IV

    Neural Network Segmentation of Interstitial Fibrosis, Tubular Atrophy, and Glomerulosclerosis in Renal Biopsies

    Authors: Brandon Ginley, Kuang-Yu Jen, Avi Rosenberg, Felicia Yen, Sanjay Jain, Agnes Fogo, Pinaki Sarder

    Abstract: Glomerulosclerosis, interstitial fibrosis, and tubular atrophy (IFTA) are histologic indicators of irrecoverable kidney injury. In standard clinical practice, the renal pathologist visually assesses, under the microscope, the percentage of sclerotic glomeruli and the percentage of renal cortical involvement by IFTA. Estimation of IFTA is a subjective process due to a varied spectrum and definition… ▽ More

    Submitted 28 February, 2020; originally announced February 2020.

  24. arXiv:2002.03788  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

    Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu

    Abstract: Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech,… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

    Comments: To appear in ICASSP 2020

  25. arXiv:1909.11699  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Recognition with Augmented Synthesized Speech

    Authors: Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye Jia, Pedro Moreno, Yonghui Wu, Zelin Wu

    Abstract: Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and… ▽ More

    Submitted 25 September, 2019; originally announced September 2019.

    Comments: Accepted for publication at ASRU 2020

  26. arXiv:1907.04448  [pdf, other

    cs.CL cs.SD eess.AS

    Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

    Authors: Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related… ▽ More

    Submitted 24 July, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: 5 pages, submitted to Interspeech 2019

  27. arXiv:1802.02656  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

    Authors: Xuesong Yang, Kartik Audhkhasi, Andrew Rosenberg, Samuel Thomas, Bhuvana Ramabhadran, Mark Hasegawa-Johnson

    Abstract: The performance of automatic speech recognition systems degrades with increasing mismatch between the training and testing scenarios. Differences in speaker accents are a significant source of such mismatch. The traditional approach to deal with multiple accents involves pooling data from several accents during training and building a single model in multi-task fashion, where tasks correspond to i… ▽ More

    Submitted 7 February, 2018; originally announced February 2018.

    Comments: Accepted in The 43rd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2018)