Skip to main content

Showing 1–42 of 42 results for author: Povey, D

.
  1. arXiv:2406.09589  [pdf, other

    eess.AS

    Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

    Authors: Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur

    Abstract: In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (S… ▽ More

    Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted for presentation at Interspeech 2024

  2. arXiv:2406.06571  [pdf, other

    cs.CL cs.AI

    SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

    Authors: Quandong Wang, Yuxuan Yuan, Xiaoyu Yang, Ruike Zhang, Kang Zhao, Wei Liu, Jian Luan, Daniel Povey, Bin Wang

    Abstract: While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass Large Language Model, an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules. The sub… ▽ More

    Submitted 17 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: 9 pages, 3 figures, submitted to ECAI 2024

    ACM Class: I.2.7

  3. arXiv:2406.02560  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Less Peaky and More Accurate CTC Forced Alignment by Label Priors

    Authors: Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev Khudanpur

    Abstract: Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leve… ▽ More

    Submitted 15 June, 2024; v1 submitted 22 April, 2024; originally announced June 2024.

    Comments: Accepted by ICASSP 2024. Github repo: https://github.com/huangruizhe/audio/tree/aligner_label_priors

  4. arXiv:2401.15676  [pdf, other

    eess.AS cs.SD

    On Speaker Attribution with SURT

    Authors: Desh Raj, Matthew Wiesner, Matthew Maciejewski, Leibny Paola Garcia-Perera, Daniel Povey, Sanjeev Khudanpur

    Abstract: The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, it was demonstrated that SURT can be an efficient streaming method for speaker-agnostic transcription of real meetings. In this work, we push this framework furth… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

    Comments: 8 pages, 6 figures, 6 tables. Submitted to Odyssey 2024

  5. arXiv:2310.11230  [pdf, other

    eess.AS cs.LG cs.SD

    Zipformer: A faster and better encoder for automatic speech recognition

    Authors: Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui **, Long Lin, Daniel Povey

    Abstract: The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame… ▽ More

    Submitted 9 April, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: Published as a conference paper at ICLR 2024

  6. arXiv:2309.15796  [pdf, other

    eess.AS cs.CL cs.LG

    Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

    Authors: Dongji Gao, Hainan Xu, Desh Raj, Leibny Paola Garcia Perera, Daniel Povey, Sanjeev Khudanpur

    Abstract: Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. Thi… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

  7. arXiv:2309.08105  [pdf, other

    eess.AS cs.SD

    Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

    Authors: Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey

    Abstract: In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casin… ▽ More

    Submitted 14 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  8. arXiv:2309.07414  [pdf, other

    eess.AS cs.CL cs.SD

    PromptASR for contextualized ASR with controllable style

    Authors: Xiaoyu Yang, Wei Kang, Zengwei Yao, Yifan Yang, Liyong Guo, Fangjun Kuang, Long Lin, Daniel Povey

    Abstract: Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. Specifically, a dedicated text encoder encodes the text prompts and t… ▽ More

    Submitted 24 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Proc. ICASSP 2024

  9. arXiv:2309.07377  [pdf, other

    eess.AS cs.SD

    Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

    Authors: Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen

    Abstract: Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speec… ▽ More

    Submitted 14 December, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Accepted in ICASSP 2024

  10. arXiv:2308.06547  [pdf, other

    eess.AS cs.CL cs.SD

    Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition

    Authors: Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan

    Abstract: When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either fi… ▽ More

    Submitted 12 August, 2023; originally announced August 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2023

  11. arXiv:2306.10559  [pdf, other

    eess.AS cs.SD

    SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition

    Authors: Desh Raj, Daniel Povey, Sanjeev Khudanpur

    Abstract: The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR). Despite impressive results on multi-turn meetings, SURT has notable limitations: (i) it suffers from leakage and omission related errors; (ii) it is computationally expensive, due to which it has not seen adoption in academ… ▽ More

    Submitted 19 September, 2023; v1 submitted 18 June, 2023; originally announced June 2023.

    Comments: 13 pages, 7 figures. To appear in IEEE TASLP. Project webpage: https://sites.google.com/view/surt2

  12. arXiv:2306.01031  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts

    Authors: Dongji Gao, Matthew Wiesner, Hainan Xu, Leibny Paola Garcia, Daniel Povey, Sanjeev Khudanpur

    Abstract: This paper presents a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. Imperfectly transcribed speech is a prevalent issue in human-annotated speech corpora, which degrades the performance of ASR models. To address this problem, we propose Bypass Temporal Classification (BTC) as an expansion of the Connectionist Temporal Classification (CTC) cr… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  13. arXiv:2305.11558  [pdf, other

    eess.AS cs.CL

    Blank-regularized CTC for Frame Skip** in Neural Transducer

    Authors: Yifan Yang, Xiaoyu Yang, Liyong Guo, Zengwei Yao, Wei Kang, Fangjun Kuang, Long Lin, Xie Chen, Daniel Povey

    Abstract: Neural Transducer and connectionist temporal classification (CTC) are popular end-to-end automatic speech recognition systems. Due to their frame-synchronous design, blank symbols are introduced to address the length mismatch between acoustic frames and output tokens, which might bring redundant computation. Previous studies managed to accelerate the training and inference of neural Transducers by… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

    Comments: Accepted in INTERSPEECH 2023

  14. arXiv:2305.11539  [pdf, other

    eess.AS

    Delay-penalized CTC implemented based on Finite State Transducer

    Authors: Zengwei Yao, Wei Kang, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Yifan Yang, Long Lin, Daniel Povey

    Abstract: Connectionist Temporal Classification (CTC) suffers from the latency problem when applied to streaming models. We argue that in CTC lattice, the alignments that can access more future context are preferred during training, thereby leading to higher symbol delay. In this work we propose the delay-penalized CTC which is augmented with latency penalty regularization. We devise a flexible and efficien… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

    Comments: Accepted in INTERSPEECH 2023

  15. arXiv:2212.05271  [pdf, other

    eess.AS cs.SD

    GPU-accelerated Guided Source Separation for Meeting Transcription

    Authors: Desh Raj, Daniel Povey, Sanjeev Khudanpur

    Abstract: Guided source separation (GSS) is a type of target-speaker extraction method that relies on pre-computed speaker activities and blind source separation to perform front-end enhancement of overlapped speech signals. It was first proposed during the CHiME-5 challenge and provided significant improvements over the delay-and-sum beamforming baseline. Despite its strengths, however, the method has seen… ▽ More

    Submitted 13 August, 2023; v1 submitted 10 December, 2022; originally announced December 2022.

    Comments: 7 pages, 4 figures. To appear at InterSpeech 2023. Code available at https://github.com/desh2608/gss

  16. arXiv:2211.00508  [pdf, other

    eess.AS cs.CL cs.SD

    Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation

    Authors: Liyong Guo, Xiaoyu Yang, Quandong Wang, Yuxiang Kong, Zengwei Yao, Fan Cui, Fangjun Kuang, Wei Kang, Long Lin, Mingshuang Luo, Piotr Zelasko, Daniel Povey

    Abstract: Knowledge distillation(KD) is a common approach to improve model performance in automatic speech recognition (ASR), where a student model is trained to imitate the output behaviour of a teacher model. However, traditional KD methods suffer from teacher label storage issue, especially when the training corpora are large. Although on-the-fly teacher label generation tackles this issue, the training… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2022

  17. arXiv:2211.00490  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Delay-penalized transducer for low-latency streaming ASR

    Authors: Wei Kang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Long lin, Piotr Żelasko, Daniel Povey

    Abstract: In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy. Although a few existing methods are able to achieve this goal, they are difficult to implement due to their dependency on external alignments. In this paper, we propose a simple way to penalize symbol delay in transducer model, so that we can b… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

    Comments: Submitted to 2023 IEEE International Conference on Acoustics, Speech and Signal Processing

  18. arXiv:2211.00484  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Fast and parallel decoding for transducer

    Authors: Wei Kang, Liyong Guo, Fangjun Kuang, Long Lin, Mingshuang Luo, Zengwei Yao, Xiaoyu Yang, Piotr Żelasko, Daniel Povey

    Abstract: The transducer architecture is becoming increasingly popular in the field of speech recognition, because it is naturally streaming as well as high in accuracy. One of the drawbacks of transducer is that it is difficult to decode in a fast and parallel way due to an unconstrained number of symbols that can be emitted per time step. In this work, we introduce a constrained version of transducer loss… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

    Comments: Submitted to 2023 IEEE International Conference on Acoustics, Speech and Signal Processing

  19. arXiv:2206.13236  [pdf, other

    eess.AS cs.AI cs.LG

    Pruned RNN-T for fast, memory-efficient ASR training

    Authors: Fangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, Daniel Povey

    Abstract: The RNN-Transducer (RNN-T) framework for speech recognition has been growing in popularity, particularly for deployed real-time ASR systems, because it combines high accuracy with naturally streaming recognition. One of the drawbacks of RNN-T is that its loss function is relatively slow to compute, and can use a lot of memory. Excessive GPU memory usage can make it impractical to use RNN-T loss in… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

  20. arXiv:2110.12561  [pdf, other

    cs.SD eess.AS

    Lhotse: a speech data representation library for the modern deep learning ecosystem

    Authors: Piotr Żelasko, Daniel Povey, Jan "Yenda" Trmal, Sanjeev Khudanpur

    Abstract: Speech data is notoriously difficult to work with due to a variety of codecs, lengths of recordings, and meta-data formats. We present Lhotse, a speech data representation library that draws upon lessons learned from Kaldi speech recognition toolkit and brings its concepts into the modern deep learning ecosystem. Lhotse provides a common JSON description format with corresponding Python classes an… ▽ More

    Submitted 24 October, 2021; originally announced October 2021.

    Comments: Accepted for presentation at NeurIPS 2021 Data-Centric AI (DCAI) Workshop

  21. arXiv:2106.06909  [pdf, other

    cs.SD cs.CL eess.AS

    GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

    Authors: Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie **, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan

    Abstract: This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous sp… ▽ More

    Submitted 13 June, 2021; originally announced June 2021.

  22. arXiv:2104.01378  [pdf, other

    cs.CL cs.SD eess.AS

    speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment

    Authors: Junbo Zhang, Zhiwen Zhang, Yongqing Wang, Zhiyong Yan, Qiong Song, Yukai Huang, Ke Li, Daniel Povey, Yujun Wang

    Abstract: This paper introduces a new open-source speech corpus named "speechocean762" designed for pronunciation assessment use, consisting of 5000 English utterances from 250 non-native speakers, where half of the speakers are children. Five experts annotated each of the utterances at sentence-level, word-level and phoneme-level. A baseline system is released in open source to illustrate the phoneme-level… ▽ More

    Submitted 2 June, 2021; v1 submitted 3 April, 2021; originally announced April 2021.

    Comments: Accepted in INTERSPEECH 2021

  23. arXiv:2103.09063  [pdf, other

    cs.SD eess.AS

    An Asynchronous WFST-Based Decoder For Automatic Speech Recognition

    Authors: Hang Lv, Zhehuai Chen, Hainan Xu, Daniel Povey, Lei Xie, Sanjeev Khudanpur

    Abstract: We introduce asynchronous dynamic decoder, which adopts an efficient A* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with… ▽ More

    Submitted 16 March, 2021; originally announced March 2021.

    Comments: 5 pages, 5 figures, icassp

  24. arXiv:2103.05081  [pdf, other

    eess.AS cs.CL cs.SD

    A Parallelizable Lattice Rescoring Strategy with Neural Language Models

    Authors: Ke Li, Daniel Povey, Sanjeev Khudanpur

    Abstract: This paper proposes a parallel computation strategy and a posterior-based lattice expansion algorithm for efficient lattice rescoring with neural language models (LMs) for automatic speech recognition. First, lattices from first-pass decoding are expanded by the proposed posterior-based lattice expansion algorithm. Second, each expanded lattice is converted into a minimal list of hypotheses that c… ▽ More

    Submitted 8 March, 2021; originally announced March 2021.

    Comments: To appear at ICASSP 2021. 5 pages, 1 figure

  25. arXiv:2102.04488  [pdf, other

    cs.CL cs.SD eess.AS

    Wake Word Detection with Streaming Transformers

    Authors: Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

    Abstract: Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Tr… ▽ More

    Submitted 8 February, 2021; originally announced February 2021.

    Comments: Accepted at IEEE ICASSP 2021. 5 pages, 3 figures

  26. arXiv:2011.02090  [pdf, other

    eess.AS cs.SD

    Frustratingly Easy Noise-aware Training of Acoustic Models

    Authors: Desh Raj, Jesus Villalba, Daniel Povey, Sanjeev Khudanpur

    Abstract: Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it requires many-folds data augmentation, resulting in increased training time. In this paper, we propose utterance-level noise vectors for noise-aware training of a… ▽ More

    Submitted 2 February, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

    Comments: 6 + 3 (Appendix) pages

  27. arXiv:2011.01997  [pdf, other

    eess.AS cs.SD

    DOVER-Lap: A Method for Combining Overlap-aware Diarization Outputs

    Authors: Desh Raj, Leibny Paola Garcia-Perera, Zili Huang, Shinji Watanabe, Daniel Povey, Andreas Stolcke, Sanjeev Khudanpur

    Abstract: Several advances have been made recently towards handling overlap** speech for speaker diarization. Since speech and natural language tasks often benefit from ensemble techniques, we propose an algorithm for combining outputs from such diarization systems through majority voting. Our method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlap**… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE SLT 2021

  28. arXiv:2009.13774  [pdf, other

    eess.AS

    Neural Language Modeling With Implicit Cache Pointers

    Authors: Ke Li, Daniel Povey, Sanjeev Khudanpur

    Abstract: A cache-inspired approach is proposed for neural language models (LMs) to improve long-range dependency and better predict rare words from long contexts. This approach is a simpler alternative to attention-based pointer mechanism that enables neural LMs to reproduce words from recent history. Without using attention and mixture structure, the method only involves appending extra tokens that repres… ▽ More

    Submitted 29 September, 2020; originally announced September 2020.

    Comments: To appear at Interspeech 2020

  29. arXiv:2008.13213  [pdf, other

    eess.AS

    Mixture of Speaker-type PLDAs for Children's Speech Diarization

    Authors: Jiamin Xie, Suzanna Sia, Paola Garcia, Daniel Povey, Sanjeev Khudanpur

    Abstract: In diarization, the PLDA is typically used to model an inference structure which assumes the variation in speech segments be induced by various speakers. The speaker variation is then learned from the training data. However, human perception can differentiate speakers by age, gender, among other characteristics. In this paper, we investigate a speaker-type informed model that explicitly captures t… ▽ More

    Submitted 30 August, 2020; originally announced August 2020.

    Comments: submitted to Interspeech 2020

  30. arXiv:2008.02385  [pdf, other

    cs.CL

    Efficient MDI Adaptation for n-gram Language Models

    Authors: Ruizhe Huang, Ke Li, Ashish Arora, Dan Povey, Sanjeev Khudanpur

    Abstract: This paper presents an efficient algorithm for n-gram language model adaptation under the minimum discrimination information (MDI) principle, where an out-of-domain language model is adapted to satisfy the constraints of marginal probabilities of the in-domain data. The challenge for MDI language model adaptation is its computational complexity. By taking advantage of the backoff structure of n-gr… ▽ More

    Submitted 5 August, 2020; originally announced August 2020.

    Comments: To appear in INTERSPEECH 2020. Appendix A of this full version will be filled soon

  31. arXiv:2005.10470  [pdf, other

    eess.AS cs.CL cs.SD

    Multistream CNN for Robust Acoustic Modeling

    Authors: Kyu J. Han, **g Pan, Venkata Krishna Naveen Tadala, Tao Ma, Dan Povey

    Abstract: This paper proposes multistream CNN, a novel neural network architecture for robust acoustic modeling in speech recognition tasks. The proposed architecture processes input speech with diverse temporal resolutions by applying different dilation rates to convolutional neural networks across multiple streams to achieve the robustness. The dilation rates are selected from the multiples of a sub-sampl… ▽ More

    Submitted 25 April, 2021; v1 submitted 21 May, 2020; originally announced May 2020.

    Comments: Accepted to ICASSP 2021

  32. arXiv:2005.09824  [pdf, other

    eess.AS cs.CL cs.SD

    PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

    Authors: Yiwen Shao, Yiming Wang, Daniel Povey, Sanjeev Khudanpur

    Abstract: We present PyChain, a fully parallelized PyTorch implementation of end-to-end lattice-free maximum mutual information (LF-MMI) training for the so-called \emph{chain models} in the Kaldi automatic speech recognition (ASR) toolkit. Unlike other PyTorch and Kaldi based ASR toolkits, PyChain is designed to be as flexible and light-weight as possible so that it can be easily plugged into new ASR proje… ▽ More

    Submitted 19 May, 2020; originally announced May 2020.

    Comments: Submtted to Interspeech 2020

  33. arXiv:2005.08347  [pdf, other

    eess.AS cs.CL cs.SD

    Wake Word Detection with Alignment-Free Lattice-Free MMI

    Authors: Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

    Abstract: Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-tra… ▽ More

    Submitted 28 July, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

    Comments: Accepted at Interspeech 2020. 5 pages, 3 figures

  34. arXiv:2004.09249  [pdf, other

    cs.SD cs.CL eess.AS

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

    Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous C… ▽ More

    Submitted 2 May, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

  35. arXiv:2002.06220  [pdf, other

    eess.AS cs.SD

    Speaker Diarization with Region Proposal Network

    Authors: Zili Huang, Shinji Watanabe, Yusuke Fujita, Paola Garcia, Yiwen Shao, Daniel Povey, Sanjeev Khudanpur

    Abstract: Speaker diarization is an important pre-processing step for many speech applications, and it aims to solve the "who spoke when" problem. Although the standard diarization systems can achieve satisfactory results in various scenarios, they are composed of several independently-optimized modules and cannot deal with the overlapped speech. In this paper, we propose a novel speaker diarization method:… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: Accepted to ICASSP 2020

  36. arXiv:1910.10032  [pdf, ps, other

    cs.CL eess.AS

    GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition

    Authors: Hugo Braun, Justin Luitjens, Ryan Leary, Tim Kaldewey, Daniel Povey

    Abstract: We present an optimized weighted finite-state transducer (WFST) decoder capable of online streaming and offline batch processing of audio using Graphics Processing Units (GPUs). The decoder is efficient in memory utilization, input/output (I/O) bandwidth, and uses a novel Viterbi implementation designed to maximize parallelism. The reduced memory footprint allows the decoder to process significant… ▽ More

    Submitted 13 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Accepted to ICASSP 2020

  37. Probing the Information Encoded in X-vectors

    Authors: Desh Raj, David Snyder, Daniel Povey, Sanjeev Khudanpur

    Abstract: Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks. In this paper, we use simple classifiers to investigate the contents encoded by x-vector embeddings. We probe these embeddings for information related to the speaker, channel, transcription (sentence, words, phones), and meta information about… ▽ More

    Submitted 30 September, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted at IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2019

    Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2019): 726-733

  38. arXiv:1804.03243  [pdf, other

    cs.CL

    A GPU-based WFST Decoder with Exact Lattice Generation

    Authors: Zhehuai Chen, Justin Luitjens, Hainan Xu, Yiming Wang, Daniel Povey, Sanjeev Khudanpur

    Abstract: We describe initial work on an extension of the Kaldi toolkit that supports weighted finite-state transducer (WFST) decoding on Graphics Processing Units (GPUs). We implement token recombination as an atomic GPU operation in order to fully parallelize the Viterbi beam search, and propose a dynamic load balancing strategy for more efficient token passing scheduling among GPU threads. We also redesi… ▽ More

    Submitted 27 July, 2018; v1 submitted 9 April, 2018; originally announced April 2018.

    Comments: accepted by INTERSPEECH 2018

    MSC Class: 68T10 ACM Class: I.2.7

  39. arXiv:1706.03747  [pdf, other

    cs.CL

    Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework

    Authors: Xiaohui Zhang, Vimal Manohar, Daniel Povey, Sanjeev Khudanpur

    Abstract: Speech recognition systems for irregularly-spelled languages like English normally require hand-written pronunciations. In this paper, we describe a system for automatically obtaining pronunciations of words for which pronunciations are not available, but for which transcribed data exists. Our method integrates information from the letter sequence and from the acoustic evidence. The novel aspect o… ▽ More

    Submitted 12 June, 2017; originally announced June 2017.

  40. arXiv:1510.08484  [pdf, other

    cs.SD

    MUSAN: A Music, Speech, and Noise Corpus

    Authors: David Snyder, Guoguo Chen, Daniel Povey

    Abstract: This report introduces a new corpus of music, speech, and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. Our corpus is released under a flexible Creative Commons license. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises. We demonstrate… ▽ More

    Submitted 28 October, 2015; originally announced October 2015.

  41. arXiv:1410.7455  [pdf, ps, other

    cs.NE cs.LG stat.ML

    Parallel training of DNNs with Natural Gradient and Parameter Averaging

    Authors: Daniel Povey, Xiaohui Zhang, Sanjeev Khudanpur

    Abstract: We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural networ… ▽ More

    Submitted 22 June, 2015; v1 submitted 27 October, 2014; originally announced October 2014.

    Comments: Accepted as workshop contribution to ICLR 2015. 12 pages plus 16 pages of appendices, International Conference on Learning Representations (ICLR): Workshop track, 2015. [2 sets of minor fixes post-publication.]

  42. arXiv:1111.4259  [pdf, ps, other

    stat.ML math.OC

    Krylov Subspace Descent for Deep Learning

    Authors: Oriol Vinyals, Daniel Povey

    Abstract: In this paper, we propose a second order optimization method to learn models where both the dimensionality of the parameter space and the number of training samples is high. In our method, we construct on each iteration a Krylov subspace formed by the gradient and an approximation to the Hessian matrix, and then use a subset of the training data samples to optimize over this subspace. As with the… ▽ More

    Submitted 17 November, 2011; originally announced November 2011.