Skip to main content

Showing 1–20 of 20 results for author: Saraf, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2212.12048  [pdf, other

    cs.CL cs.SD eess.AS

    Pushing the performances of ASR models on English and Spanish accents

    Authors: Pooja Chitkara, Morgane Riviere, Jade Copet, Frank Zhang, Yatharth Saraf

    Abstract: Speech to text models tend to be trained and evaluated against a single target accent. This is especially true for English for which native speakers from the United States became the main benchmark. In this work, we are going to show how two simple methods: pre-trained embeddings and auxiliary classification losses can improve the performance of ASR systems. We are looking for upgrades as universa… ▽ More

    Submitted 22 December, 2022; originally announced December 2022.

  2. arXiv:2207.09674  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Improving Data Driven Inverse Text Normalization using Data Augmentation

    Authors: Laxmi Pandey, Debjyoti Paul, Pooja Chitkara, Yutong Pang, Xuedong Zhang, Kjell Schubert, Mark Chou, Shu Liu, Yatharth Saraf

    Abstract: Inverse text normalization (ITN) is used to convert the spoken form output of an automatic speech recognition (ASR) system to a written form. Traditional handcrafted ITN rules can be complex to transcribe and maintain. Meanwhile neural modeling approaches require quality large-scale spoken-written pair examples in the same or similar domain as the ASR system (in-domain data), to train. Both these… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

  3. arXiv:2111.09983  [pdf, other

    eess.AS cs.SD

    Towards Measuring Fairness in Speech Recognition: Casual Conversations Dataset Transcriptions

    Authors: Chunxi Liu, Michael Picheny, Leda Sarı, Pooja Chitkara, Alex Xiao, Xiaohui Zhang, Mark Chou, Andres Alvarado, Caner Hazirbas, Yatharth Saraf

    Abstract: It is well known that many machine learning systems demonstrate bias towards specific groups of individuals. This problem has been studied extensively in the Facial Recognition area, but much less so in Automatic Speech Recognition (ASR). This paper presents initial Speech Recognition results on "Casual Conversations" -- a publicly released 846 hour corpus designed to help researchers evaluate the… ▽ More

    Submitted 18 November, 2021; originally announced November 2021.

    Comments: Submitted to ICASSP 2022. Our dataset will be publicly available at (https://ai.facebook.com/datasets/casual-conversations-downloads) for general use. We also would like to note that considering the limitations of our dataset, we limit the use of it for only evaluation purposes (see license agreement)

  4. arXiv:2111.09296  [pdf, other

    cs.CL cs.SD eess.AS

    XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

    Authors: Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

    Abstract: This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, b… ▽ More

    Submitted 16 December, 2021; v1 submitted 17 November, 2021; originally announced November 2021.

  5. arXiv:2111.05948  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling ASR Improves Zero and Few Shot Learning

    Authors: Alex Xiao, Weiyi Zheng, Gil Keren, Duc Le, Frank Zhang, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed

    Abstract: With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such a… ▽ More

    Submitted 29 November, 2021; v1 submitted 10 November, 2021; originally announced November 2021.

  6. arXiv:2110.07313  [pdf, other

    cs.SD cs.LG eess.AS

    Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks

    Authors: Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi Liu, Kritika Singh, Yatharth Saraf

    Abstract: Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks. In this paper, we propose a self-supervised audio representation learning method and… ▽ More

    Submitted 6 January, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: 4 pages. Submitted to ICASSP in Oct 2021

  7. arXiv:2107.04154  [pdf, other

    eess.AS cs.LG

    On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models

    Authors: Xiaohui Zhang, Vimal Manohar, David Zhang, Frank Zhang, Yangyang Shi, Nayan Singhal, Julian Chan, Fuchun Peng, Yatharth Saraf, Mike Seltzer

    Abstract: Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybri… ▽ More

    Submitted 26 September, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

    Comments: accepted by ASRU 2021

  8. arXiv:2107.04082  [pdf, other

    cs.CL cs.SD eess.AS

    Improved Language Identification Through Cross-Lingual Self-Supervised Learning

    Authors: Andros Tjandra, Diptanu Gon Choudhury, Frank Zhang, Kritika Singh, Alexis Conneau, Alexei Baevski, Assaf Sela, Yatharth Saraf, Michael Auli

    Abstract: Language identification greatly impacts the success of downstream tasks such as automatic speech recognition. Recently, self-supervised speech representations learned by wav2vec 2.0 have been shown to be very effective for a range of speech tasks. We extend previous self-supervised work on language identification by experimenting with pre-trained models which were learned on real-world unconstrain… ▽ More

    Submitted 17 October, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

  9. arXiv:2106.07759  [pdf, ps, other

    eess.AS cs.CL

    Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

    Authors: Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR). The proposed approach uses a teacher model which is updated as the exponential moving average (EMA) of the student model parameters. We demonstrate that it is critical for EMA to be accumulated with full-precision floating point. The Ka… ▽ More

    Submitted 27 October, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Updated with camera ready version

  10. arXiv:2104.02194  [pdf, other

    cs.CL cs.LG eess.AS

    Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion

    Authors: Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer

    Abstract: How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area. Previous solutions to this problem were either designed for specialized use cases that did not generalize well to open-domain scenarios, did not scale to large biasing lists, or underperformed on rare long-tail words. We address these limitations by proposing a novel solution that… ▽ More

    Submitted 11 June, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Accepted for presentation at INTERSPEECH 2021

  11. arXiv:2102.06291  [pdf, other

    cs.SD cs.LG eess.AS eess.IV

    A Multi-View Approach To Audio-Visual Speaker Verification

    Authors: Leda Sarı, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan Singhal, Yatharth Saraf

    Abstract: Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

  12. arXiv:2011.04785  [pdf, ps, other

    eess.AS cs.SD

    Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR

    Authors: Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-Feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig

    Abstract: In this work, to measure the accuracy and efficiency for a latency-controlled streaming automatic speech recognition (ASR) application, we perform comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T. In transcribing social media videos of 7 languages with training data 3K-14K hours, we conduct large-scale controlled experimentation across each criterion using identi… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

    Comments: Accepted for publication at IEEE Spoken Language Technology Workshop (SLT), 2021

  13. arXiv:2011.03840  [pdf, other

    cs.SD eess.AS

    Dual Application of Speech Enhancement for Automatic Speech Recognition

    Authors: Ashutosh Pandey, Chunxi Liu, Yun Wang, Yatharth Saraf

    Abstract: In this work, we exploit speech enhancement for improving a recurrent neural network transducer (RNN-T) based ASR system. We employ a dense convolutional recurrent network (DCRN) for complex spectral map** based speech enhancement, and find it helpful for ASR in two ways: a data augmentation technique, and a preprocessing frontend. In using it for ASR data augmentation, we exploit a KL divergenc… ▽ More

    Submitted 7 November, 2020; originally announced November 2020.

    Comments: Accepted for publication in SLT 2021

  14. arXiv:2011.03109  [pdf, other

    cs.CL cs.SD eess.AS

    Improving RNN Transducer Based ASR with Auxiliary Tasks

    Authors: Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, Geoffrey Zweig

    Abstract: End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers. Specifically, recurrent neural network transducer (RNN-T) has shown competitive ASR performance on various benchmarks. In this work, we examine ways in which RNN-T can achieve better ASR accuracy via performing aux… ▽ More

    Submitted 8 November, 2020; v1 submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted for publication at IEEE Spoken Language Technology Workshop (SLT), 2021

  15. arXiv:2006.03411  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Contextual RNN-T For Open Domain ASR

    Authors: Mahaveer Jain, Gil Keren, Jay Mahadeokar, Geoffrey Zweig, Florian Metze, Yatharth Saraf

    Abstract: End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this… ▽ More

    Submitted 12 August, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

  16. arXiv:2005.09150  [pdf, other

    eess.AS cs.CL

    Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

    Authors: Frank Zhang, Yongqiang Wang, Xiaohui Zhang, Chunxi Liu, Yatharth Saraf, Geoffrey Zweig

    Abstract: In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by exclud… ▽ More

    Submitted 16 August, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: In proceedings Interspeech 2020

  17. arXiv:2005.07850  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Large scale weakly and semi-supervised learning for low-resource video ASR

    Authors: Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on th… ▽ More

    Submitted 6 August, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

  18. arXiv:2005.07394  [pdf, other

    cs.CL cs.SD eess.AS

    Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model

    Authors: Da-Rong Liu, Chunxi Liu, Frank Zhang, Gabriel Synnaeve, Yatharth Saraf, Geoffrey Zweig

    Abstract: Videos uploaded on social media are often accompanied with textual descriptions. In building automatic speech recognition (ASR) systems for videos, we can exploit the contextual information provided by such video metadata. In this paper, we explore ASR lattice rescoring by selectively attending to the video descriptions. We first use an attention based method to extract contextual vector represent… ▽ More

    Submitted 15 May, 2020; originally announced May 2020.

  19. arXiv:1910.12367  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Training ASR models by Generation of Contextual Information

    Authors: Kritika Singh, Dmytro Okhonko, Jun Liu, Yongqiang Wang, Frank Zhang, Ross Girshick, Sergey Edunov, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised lea… ▽ More

    Submitted 14 February, 2020; v1 submitted 27 October, 2019; originally announced October 2019.

  20. arXiv:1909.06522  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Multilingual Graphemic Hybrid ASR with Massive Data Augmentation

    Authors: Chunxi Liu, Qiaochu Zhang, Xiaohui Zhang, Kritika Singh, Yatharth Saraf, Geoffrey Zweig

    Abstract: Towards develo** high-performing ASR for low-resource languages, approaches to address the lack of resources are to make use of data from multiple languages, and to augment the training data by creating acoustic variations. In this work we present a single grapheme-based ASR model learned on 7 geographically proximal languages, using standard hybrid BLSTM-HMM acoustic models with lattice-free MM… ▽ More

    Submitted 8 April, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted for publication at the 1st Joint Workshop of SLTU (Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Computing for Under-Resourced Languages) (SLTU-CCURL 2020)