Skip to main content

Showing 1–11 of 11 results for author: Manohar, V

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.02560  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Less Peaky and More Accurate CTC Forced Alignment by Label Priors

    Authors: Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev Khudanpur

    Abstract: Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leve… ▽ More

    Submitted 15 June, 2024; v1 submitted 22 April, 2024; originally announced June 2024.

    Comments: Accepted by ICASSP 2024. Github repo: https://github.com/huangruizhe/audio/tree/aligner_label_priors

  2. Acoustic modeling for Overlap** Speech Recognition: JHU Chime-5 Challenge System

    Authors: Vimal Manohar, Szu-Jui Chen, Zhiqi Wang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays. We explore data augmentation approaches, neural network architectures, front-end speech dereverberation, beamforming and robust i-vector extraction with comparisons of our i… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: Published in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Journal ref: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 6665-6669

  3. arXiv:2306.15687  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

    Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

    Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative… ▽ More

    Submitted 19 October, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

    Comments: Accepted to NeurIPS 2023

  4. arXiv:2303.12197  [pdf, other

    eess.AS

    Self-Supervised Representations for Singing Voice Conversion

    Authors: Tejas Jayashankar, Jilong Wu, Leda Sari, David Kant, Vimal Manohar, Qing He

    Abstract: A singing voice conversion model converts a song in the voice of an arbitrary source singer to the voice of a target singer. Recently, methods that leverage self-supervised audio representations such as HuBERT and Wav2Vec 2.0 have helped further the state-of-the-art. Though these methods produce more natural and melodic singing outputs, they often rely on confusion and disentanglement losses to re… ▽ More

    Submitted 21 March, 2023; originally announced March 2023.

  5. arXiv:2211.13282  [pdf, other

    cs.SD cs.AI eess.AS

    Voice-preserving Zero-shot Multiple Accent Conversion

    Authors: Mumin **, Prashant Serai, Jilong Wu, Andros Tjandra, Vimal Manohar, Qing He

    Abstract: Most people who have tried to learn a foreign language would have experienced difficulties understanding or speaking with a native speaker's accent. For native speakers, understanding or speaking a new accent is likewise a difficult task. An accent conversion system that changes a speaker's accent but preserves that speaker's voice identity, such as timbre and pitch, has the potential for a range… ▽ More

    Submitted 14 October, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: Accepted to IEEE ICASSP 2023

  6. arXiv:2210.16045  [pdf, other

    cs.SD cs.CL eess.AS

    Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

    Authors: Jason Fong, Yun Wang, Prabhav Agrawal, Vimal Manohar, Jilong Wu, Thilo Köhler, Qing He

    Abstract: Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  7. arXiv:2110.03520  [pdf, other

    eess.AS

    Accent-Robust Automatic Speech Recognition Using Supervised and Unsupervised Wav2vec Embeddings

    Authors: Jialu Li, Vimal Manohar, Pooja Chitkara, Andros Tjandra, Michael Picheny, Frank Zhang, Xiaohui Zhang, Yatharth Saraf

    Abstract: Speech recognition models often obtain degraded performance when tested on speech with unseen accents. Domain-adversarial training (DAT) and multi-task learning (MTL) are two common approaches for building accent-robust ASR models. ASR models using accent embeddings is another approach for improving robustness to accents. In this study, we perform systematic comparisons of DAT and MTL approaches u… ▽ More

    Submitted 8 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  8. arXiv:2107.04154  [pdf, other

    eess.AS cs.LG

    On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models

    Authors: Xiaohui Zhang, Vimal Manohar, David Zhang, Frank Zhang, Yangyang Shi, Nayan Singhal, Julian Chan, Fuchun Peng, Yatharth Saraf, Mike Seltzer

    Abstract: Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybri… ▽ More

    Submitted 26 September, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

    Comments: accepted by ASRU 2021

  9. arXiv:2106.07759  [pdf, ps, other

    eess.AS cs.CL

    Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

    Authors: Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR). The proposed approach uses a teacher model which is updated as the exponential moving average (EMA) of the student model parameters. We demonstrate that it is critical for EMA to be accumulated with full-precision floating point. The Ka… ▽ More

    Submitted 27 October, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Updated with camera ready version

  10. arXiv:2005.07850  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Large scale weakly and semi-supervised learning for low-resource video ASR

    Authors: Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on th… ▽ More

    Submitted 6 August, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

  11. arXiv:2004.09249  [pdf, other

    cs.SD cs.CL eess.AS

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

    Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous C… ▽ More

    Submitted 2 May, 2020; v1 submitted 20 April, 2020; originally announced April 2020.